CyberSec Leaderboard

Model Rankings

Models are ranked by Overall Score on CyberSec-Bench. Scores are colour-coded: green ≥ 80%, yellow ≥ 60%, red < 60%.


1	AYI-NEDJIMI/CyberSec-Assistant-3B	61.1	40.6	34.4	33.5	38.1	42.9	1	2026-02-20


1	AYI-NEDJIMI/CyberSec-Assistant-3B	61.1%	40.6%	41.0%	33.5%	38.1%	42.9%	1	2026-02-20
2	AYI-NEDJIMI/ISO27001-Expert-1.5B	49.0%	40.0%	34.4%	28.8%	34.0%	37.2%	1	2026-02-20
3	AYI-NEDJIMI/RGPD-Expert-1.5B	41.0%	36.7%	35.0%	27.8%	34.6%	35.0%	1	2026-02-20

Performance Radar

Plot

CyberSec-Bench

CyberSec-Bench is a bilingual (English / French) benchmark designed to evaluate the cybersecurity knowledge of Large Language Models and AI systems. It contains 200 expert-crafted questions spanning five critical domains of cybersecurity, each with detailed reference answers.

The benchmark tests real-world professional-level knowledge, covering international standards, attack frameworks, defensive operations, forensic investigation, and cloud security architectures.

Dataset: AYI-NEDJIMI/CyberSec-Bench

Category Breakdown

Compliance (40 questions)

ISO 27001, GDPR/RGPD, NIS2, DORA, AI Act. Tests understanding of regulatory frameworks, risk management, and audit processes.

Offensive Security (40 questions)

MITRE ATT&CK, OWASP Top 10, penetration testing, vulnerability assessment, red team operations, and exploit development.

Defensive Security (40 questions)

SOC operations, incident response, SIEM/SOAR, threat detection, network security monitoring, and security architecture design.

Digital Forensics (40 questions)

Evidence collection & preservation, memory forensics, disk imaging, malware analysis, chain of custody, and timeline reconstruction.

Cloud Security (40 questions)

Azure, AWS, GCP security controls, Zero Trust architecture, DevSecOps, container security, IAM, and cloud compliance.

Difficulty Distribution


Intermediate	60	30%	Cutting-edge challenges and multi-domain reasoning


Intermediate	60	30%	Core concepts and standard procedures
Advanced	80	40%	Complex scenarios requiring deep knowledge
Expert	60	30%	Cutting-edge challenges and multi-domain reasoning

Evaluation Methodology

Each model response is scored against expert reference answers using a combination of:

Semantic similarity to the reference answer
Key concept coverage (specific technical terms and frameworks mentioned)
Accuracy verification (factual correctness of claims)
Completeness assessment (depth and breadth of the response)

Scores are normalised to a 0-100% scale per category, and the Overall Score is the weighted average across all five categories.

Submit Your Model for Evaluation

Submit a Hugging Face model to be evaluated on the full CyberSec-Bench benchmark.

How it works:
1. Enter the Hugging Face model ID below (e.g. organization/model-name)
2. Your submission is recorded and added to our evaluation queue
3. Evaluation is performed manually by the benchmark team
4. Once evaluation is complete, your model will appear on the leaderboard

Note: Evaluation is performed manually. Submit your model and we will evaluate it within a few days. Models must be publicly accessible on Hugging Face and support text generation.

Submission Guidelines

The model must be publicly accessible on Hugging Face
The model must support text generation (causal LM or instruction-tuned)
We evaluate using the model's default generation parameters
Both base models and fine-tuned models are welcome
GGUF quantised models are supported
Evaluation covers all 200 questions across 5 categories
Results are typically available within 3-5 business days

About CyberSec Leaderboard

CyberSec Leaderboard is an open initiative to advance AI evaluation in cybersecurity. If you would like to contribute or have questions, reach out via Hugging Face.