Review Benchmark Harness¶
Reference for the vibe-dev review eval benchmark developer tool. Covers the
golden-assessment file format, scoring semantics, CLI flags, and the
internal module layout.
The harness evaluates retrieval and assessment quality against curated ground-truth records; it is not part of the runtime Review extension. See doc/architecture/review.md for the extension architecture and review-cli.md for the full CLI surface.
Location¶
tools/vibe_dev/benchmark.py (engine) + CLI entry point review_benchmark_command
in tools/vibe_dev/review.py. The harness is a developer tool used only
from vibe-dev review eval benchmark and its tests, so it lives under
tools/vibe_dev/ alongside the other review CLI subcommands rather than
in the vibe/review/ runtime package.
Stages¶
Selected via the BenchmarkStage enum (RETRIEVE, ASSESS, FULL):
- retrieve — Run hybrid search + rerank, compare retrieved parts to golden evidence parts via fuzzy similarity matching. Reports recall (exact + broad matches).
- assess — Run the full pipeline (retrieve + LLM classification), compare predicted assessment to golden assessment. Reports the compliance + evidence + combined scores described below.
- full — Both stages. Retrieval runs once and its ranked parts are reused by the assess stage (no double reranking).
Source formats¶
find_source_documents picks up .md, .pdf, and .docx files in the
golden-record directory (excluding *.golden.* sidecars). setup_session
dispatches each file to ingest_markdown, ingest_pdf_file, or
ingest_docx_file based on extension, so benchmark cases exercise the
same parsing pipeline the workbench uses.
Golden record format (schema 0.4)¶
Evidence parts are split into two per-requirement blocks that mirror the LLM's stage-2 output:
schema_version: "0.4"
bundle_id: contract_e_risk_saas_en
template_id: nis2
assumptions:
question_answers: { ... }
requirements:
sakerhetsatgarder:
assessment: partial
primary_evidence:
dpa_security_measures:
document: contract_e_risk_saas_dpa_en.md
title: "2.3 Security Measures"
text: |
The Processor shall implement technical and organisational measures …
supporting_evidence:
security_encryption:
document: contract_e_risk_saas_security_en.md
title: "2.1 Encryption"
text: |
…
rationale: |
…
primary_evidencecontains the part(s) that most clearly determine the verdict — usually one, occasionally more.supporting_evidencereinforces the primary.- A part id must appear in exactly one block. The loader rejects duplicates across blocks.
titleis the actual section heading as it appears in the source document (no doc-code prefix like "DPA " — strip those during golden construction).textis body-only;_with_headingcombines it withtitleat match time (see "Symmetric matching" below).- Both blocks should carry an empty
{}rather than being omitted when empty. GoldenRequirement.evidence_partsreturns the primary + supporting union (primary first) and is what retrieval-stage scoring compares against.
The golden-assessment skill (.claude/skills/golden-assessment/)
encodes the authoring rules (whole-clause rule, primary/supporting
split, close-but-miss patterns, etc.).
Assessment-stage scoring¶
The assess stage reports two graded measures alongside the traditional exact-match accuracy:
- Compliance score —
yes/partial/noare treated as an ordinal axis with unit spacing: exact match = 1.0, off-by-one (yes↔partial or partial↔no) = 0.5, off-by-two (yes↔no) = 0.0.not_applicableis off-axis; any mismatch scores 0. - Evidence score — golden parts carry role-based weights (primary = 2.0, supporting = 1.0). Each golden part's earned credit uses a role-transition multiplier against the LLM's own primary/supporting split: exact role = 1.0, primary demoted to supporting = 0.5, supporting promoted to primary = 0.75, missed entirely (not retrieved or not cited) = 0.0. Score = weighted credit / total weight, in [0, 1].
- Combined score (headline) = 0.5 × compliance + 0.5 × evidence.
The formatters lead with mean_combined_score as the single-number
summary, followed by the two components and accuracy (fraction of
requirements where the compliance verdict matched exactly) as a
secondary binary reference. Each requirement's evidence breakdown is
persisted in the JSON report for post-hoc analysis. See
tools/vibe_dev/benchmark.py::compliance_score and score_evidence.
Retrieval-stage matching¶
Symmetric matching: The benchmark matcher uses combine(heading, body) from vibe/review/document_text.py on both sides before calling
classify_match. Golden parts are combined as f"{title}\n\n{text}";
retrieved parts are combined as f"{section_heading}\n\n{content}".
This keeps body-identical but heading-different sections distinguishable
and mirrors the heading-aware form embeddings and the reranker see.
Match categories: classification of retrieved-to-golden similarity
as EXACT, BROAD, NARROW, WEAK (low similarity above threshold),
or MISS. Recall counts only EXACT + BROAD as hits.
Multi-model evaluation¶
--evaluation is repeatable. Each value names an LLM endpoint (a key
under llm_endpoints in the template's config.yml). When two or more
are given, the retrieve stage runs once, its ranked parts are cached,
and the assess stage runs once per endpoint reusing the cache.
BenchmarkReport.assessments_by_model: dict[str, AssessmentReport]holds one report per endpoint (single-entry dict for the normal single-model case).BenchmarkReport.assessmentis a convenience property that returns the sole report in the single-model case and raises for multi-model runs (forcing callers onto the explicit mapping).- Table output shows one assessment block per model followed by a side-by-side comparison table (aggregate scores plus per-requirement predictions).
- JSON output carries
assessments: {name: report, …}and, when multiple, a top-levelcomparisonpayload.
File layout¶
Sections of tools/vibe_dev/benchmark.py:
| Section | Purpose |
|---|---|
| Corpus loading | Discovers benchmark cases, parses golden assessment YAML |
| Fuzzy matching | Matches retrieved parts to golden evidence parts |
| Metrics / output | Computes retrieval recall, assessment accuracy, formats reports |
| Orchestration | Runs benchmark stages, session setup |
Structured logging¶
Benchmark runs bind batch_run_id, session_id, template_id,
requirement_id, and assessment_mode via review_log_context.
Multi-model runs also bind evaluation_model so per-endpoint LLM calls
can be distinguished in post-hoc analysis. See vibe-dev review logs summarize --run <id>.
CLI¶
vibe-dev review eval benchmark <paths> with:
- Inputs: one or more
*.golden-assessment.ymlfiles; pass-r/--recursiveto expand directory arguments via**/*.golden-assessment.yml. - Stage selection:
--stage retrieve|assess|full(defaultfull). - Retrieval tuning knobs:
--rerank-top-n,--search-limit,--bm25-weight,--min-candidate-length,--no-rerank. - Requirement filter:
--requirement(repeatable) to run only specific requirement IDs. - Provider overrides:
--embedding,--reranking,--evaluation(repeatable, see above). - Output:
--jsonfor machine-readable output,-v/--verbosefor per-part match detail. - Session reuse:
--session <id>to skip ingestion when reusing an existing Review session.
Cross-references¶
- Retrieval pipeline and scoring context: doc/architecture/review.md §5.1.
- Golden-assessment authoring rules:
.claude/skills/golden-assessment/(principles, audit checklist). - Heading-aware text composition:
vibe/review/document_text.py(combine,part_text,truncate_at_whitespace).