𝗥𝗲𝗱 𝗧𝗲𝗮𝗺 𝗔𝗜 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸 𝘃𝟮.𝟬: 𝗘𝘃𝗼𝗹𝘃𝗶𝗻𝗴 𝗟𝗟𝗠 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻

We just released version 2.0 of the redteam-ai-benchmark.

Version 1.0 used 12 fixed questions. It measured if a model would refuse a question or if it could write exploit code. It worked, but it had flaws. It relied on a single "golden answer." If a model gave a correct answer using a different method, it failed. It also lacked detail. You could not see why a model failed.

Version 2.0 changes everything. We moved from 12 questions to 60.

We worked with POXEK AI to build a professional evaluation framework. This is no longer just a personal tool. It is now a community standard.

What is new in v2:

  • Structured Taxonomy: Questions cover domains like Windows tradecraft, Cloud/IAM, and Web exploitation.
  • Difficulty Levels: We test everything from basic facts to complex multi-step operator tasks.
  • Atomic Rubrics: Each question has specific pass/fail criteria. This prevents false negatives when a model uses a valid alternative method.
  • Seven Core Metrics: You can now track refusal rates, technical accuracy, critical error rates, completeness, specificity, hallucination rates, and latency.
  • Audit Mechanism: We use an "LLM-as-Judge" layer. It only reviews disputed or ambiguous cases. This provides a second opinion without destroying reproducibility.

Why this matters for you:

Stop trusting vendor claims. Use this benchmark to get real data.

  • Find dangerous models: A model might look smart but have a high critical error rate. That means it produces plausible but wrong code.
  • Understand alignment: See if a model refuses tasks because it is safe or because it is not capable.
  • Get actionable feedback: Know exactly why a model fails. Does it lack domain knowledge or does it struggle with reasoning?

The framework is MIT licensed. Use it in authorized labs, research, or educational settings. We cannot stop misuse, but we can make misuse visible through transparent scoring.

Get started:

git clone https://github.com/toxy4ny/redteam-ai-benchmark.git cd redteam-ai-benchmark uv sync uv run run_benchmark.py run ollama -m "llama3.1:8b" --profile standard

Source: https://dev.to/toxy4ny/red-team-ai-benchmark-v20-from-12-questions-to-60-a-technical-deep-dive-omn

Optional learning community: https://t.me/GyaanSetuAi