CyberSec Leaderboard
Evaluating Large Language Models on real-world cybersecurity knowledge — powered by CyberSec-Bench
200
Expert Questions
5
Categories
EN / FR
Bilingual
3
Models Evaluated
Model Rankings
Models are ranked by Overall Score on CyberSec-Bench. Scores are colour-coded: green ≥ 80%, yellow ≥ 60%, red < 60%.
1 | AYI-NEDJIMI/CyberSec-Assistant-3B | 61.1 | 40.6 | 34.4 | 33.5 | 38.1 | 42.9 | 1 | 2026-02-20 |
Performance Radar
CyberSec-Bench
CyberSec-Bench is a bilingual (English / French) benchmark designed to evaluate the cybersecurity knowledge of Large Language Models and AI systems. It contains 200 expert-crafted questions spanning five critical domains of cybersecurity, each with detailed reference answers.
The benchmark tests real-world professional-level knowledge, covering international standards, attack frameworks, defensive operations, forensic investigation, and cloud security architectures.
Dataset: AYI-NEDJIMI/CyberSec-Bench
Category Breakdown
Compliance (40 questions)
ISO 27001, GDPR/RGPD, NIS2, DORA, AI Act. Tests understanding of regulatory frameworks, risk management, and audit processes.
Offensive Security (40 questions)
MITRE ATT&CK, OWASP Top 10, penetration testing, vulnerability assessment, red team operations, and exploit development.
Defensive Security (40 questions)
SOC operations, incident response, SIEM/SOAR, threat detection, network security monitoring, and security architecture design.
Digital Forensics (40 questions)
Evidence collection & preservation, memory forensics, disk imaging, malware analysis, chain of custody, and timeline reconstruction.
Cloud Security (40 questions)
Azure, AWS, GCP security controls, Zero Trust architecture, DevSecOps, container security, IAM, and cloud compliance.
Difficulty Distribution
Intermediate | 60 | 30% | Cutting-edge challenges and multi-domain reasoning |
Intermediate | 60 | 30% | Core concepts and standard procedures |
Advanced | 80 | 40% | Complex scenarios requiring deep knowledge |
Expert | 60 | 30% | Cutting-edge challenges and multi-domain reasoning |
Evaluation Methodology
Each model response is scored against expert reference answers using a combination of:
- Semantic similarity to the reference answer
- Key concept coverage (specific technical terms and frameworks mentioned)
- Accuracy verification (factual correctness of claims)
- Completeness assessment (depth and breadth of the response)
Scores are normalised to a 0-100% scale per category, and the Overall Score is the weighted average across all five categories.
Submit Your Model for Evaluation
Submit a Hugging Face model to be evaluated on the full CyberSec-Bench benchmark.
1. Enter the Hugging Face model ID below (e.g.
organization/model-name)2. Your submission is recorded and added to our evaluation queue
3. Evaluation is performed manually by the benchmark team
4. Once evaluation is complete, your model will appear on the leaderboard
Note: Evaluation is performed manually. Submit your model and we will evaluate it within a few days. Models must be publicly accessible on Hugging Face and support text generation.
Submission Guidelines
- The model must be publicly accessible on Hugging Face
- The model must support text generation (causal LM or instruction-tuned)
- We evaluate using the model's default generation parameters
- Both base models and fine-tuned models are welcome
- GGUF quantised models are supported
- Evaluation covers all 200 questions across 5 categories
- Results are typically available within 3-5 business days
About CyberSec Leaderboard
CyberSec Leaderboard is an open initiative to advance AI evaluation in cybersecurity. If you would like to contribute or have questions, reach out via Hugging Face.