Review Benchmark Harness¶

Reference for the vibe-dev review eval benchmark developer tool. Covers the golden-assessment file format, scoring semantics, CLI flags, and the internal module layout.

The harness evaluates retrieval and assessment quality against curated ground-truth records; it is not part of the runtime Review extension. See doc/architecture/review.md for the extension architecture and review-cli.md for the full CLI surface.

Location¶

tools/vibe_dev/benchmark.py (engine) + CLI entry point review_benchmark_command in tools/vibe_dev/review.py. The harness is a developer tool used only from vibe-dev review eval benchmark and its tests, so it lives under tools/vibe_dev/ alongside the other review CLI subcommands rather than in the vibe/review/ runtime package.

Stages¶

Selected via the BenchmarkStage enum (RETRIEVE, ASSESS, FULL):

retrieve — Run hybrid search + rerank, compare retrieved parts to golden evidence parts via fuzzy similarity matching. Reports recall (exact + broad matches).
assess — Run the full pipeline (retrieve + LLM classification), compare predicted assessment to golden assessment. Reports the compliance + evidence + combined scores described below.
full — Both stages. Retrieval runs once and its ranked parts are reused by the assess stage (no double reranking).

Source formats¶

find_source_documents picks up .md, .pdf, and .docx files in the golden-record directory (excluding *.golden.* sidecars). setup_session dispatches each file to ingest_markdown, ingest_pdf_file, or ingest_docx_file based on extension, so benchmark cases exercise the same parsing pipeline the workbench uses.

Golden record format (schema 0.4)¶

Evidence parts are split into two per-requirement blocks that mirror the LLM's stage-2 output:

schema_version: "0.4"
bundle_id: contract_e_risk_saas_en
template_id: nis2
assumptions:
  question_answers: { ... }
requirements:
  sakerhetsatgarder:
    assessment: partial
    primary_evidence:
      dpa_security_measures:
        document: contract_e_risk_saas_dpa_en.md
        title: "2.3 Security Measures"
        text: |
          The Processor shall implement technical and organisational measures …
    supporting_evidence:
      security_encryption:
        document: contract_e_risk_saas_security_en.md
        title: "2.1 Encryption"
        text: |
          …
    rationale: |
      …

primary_evidence contains the part(s) that most clearly determine the verdict — usually one, occasionally more. supporting_evidence reinforces the primary.
A part id must appear in exactly one block. The loader rejects duplicates across blocks.
title is the actual section heading as it appears in the source document (no doc-code prefix like "DPA " — strip those during golden construction).
text is body-only; _with_heading combines it with title at match time (see "Symmetric matching" below).
Both blocks should carry an empty {} rather than being omitted when empty.
GoldenRequirement.evidence_parts returns the primary + supporting union (primary first) and is what retrieval-stage scoring compares against.

The golden-assessment skill (.claude/skills/golden-assessment/) encodes the authoring rules (whole-clause rule, primary/supporting split, close-but-miss patterns, etc.).

Assessment-stage scoring¶

The assess stage reports two graded measures alongside the traditional exact-match accuracy:

Compliance score — yes/partial/no are treated as an ordinal axis with unit spacing: exact match = 1.0, off-by-one (yes↔partial or partial↔no) = 0.5, off-by-two (yes↔no) = 0.0. not_applicable is off-axis; any mismatch scores 0.
Evidence score — golden parts carry role-based weights (primary = 2.0, supporting = 1.0). Each golden part's earned credit uses a role-transition multiplier against the LLM's own primary/supporting split: exact role = 1.0, primary demoted to supporting = 0.5, supporting promoted to primary = 0.75, missed entirely (not retrieved or not cited) = 0.0. Score = weighted credit / total weight, in [0, 1].
Combined score (headline) = 0.5 × compliance + 0.5 × evidence.

The formatters lead with mean_combined_score as the single-number summary, followed by the two components and accuracy (fraction of requirements where the compliance verdict matched exactly) as a secondary binary reference. Each requirement's evidence breakdown is persisted in the JSON report for post-hoc analysis. See tools/vibe_dev/benchmark.py::compliance_score and score_evidence.

Retrieval-stage matching¶

Symmetric matching: The benchmark matcher uses combine(heading, body) from vibe/review/document_text.py on both sides before calling classify_match. Golden parts are combined as f"{title}\n\n{text}"; retrieved parts are combined as f"{section_heading}\n\n{content}". This keeps body-identical but heading-different sections distinguishable and mirrors the heading-aware form embeddings and the reranker see.

Match categories: classification of retrieved-to-golden similarity as EXACT, BROAD, NARROW, WEAK (low similarity above threshold), or MISS. Recall counts only EXACT + BROAD as hits.

Multi-model evaluation¶

--evaluation is repeatable. Each value names an LLM endpoint (a key under llm_endpoints in the template's config.yml). When two or more are given, the retrieve stage runs once, its ranked parts are cached, and the assess stage runs once per endpoint reusing the cache.

BenchmarkReport.assessments_by_model: dict[str, AssessmentReport] holds one report per endpoint (single-entry dict for the normal single-model case).
BenchmarkReport.assessment is a convenience property that returns the sole report in the single-model case and raises for multi-model runs (forcing callers onto the explicit mapping).
Table output shows one assessment block per model followed by a side-by-side comparison table (aggregate scores plus per-requirement predictions).
JSON output carries assessments: {name: report, …} and, when multiple, a top-level comparison payload.

File layout¶

Sections of tools/vibe_dev/benchmark.py:

Section	Purpose
Corpus loading	Discovers benchmark cases, parses golden assessment YAML
Fuzzy matching	Matches retrieved parts to golden evidence parts
Metrics / output	Computes retrieval recall, assessment accuracy, formats reports
Orchestration	Runs benchmark stages, session setup

Structured logging¶

Benchmark runs bind batch_run_id, session_id, template_id, requirement_id, and assessment_mode via review_log_context. Multi-model runs also bind evaluation_model so per-endpoint LLM calls can be distinguished in post-hoc analysis. See vibe-dev review logs summarize --run <id>.

CLI¶

vibe-dev review eval benchmark <paths> with:

Inputs: one or more *.golden-assessment.yml files; pass -r / --recursive to expand directory arguments via **/*.golden-assessment.yml.
Stage selection: --stage retrieve|assess|full (default full).
Retrieval tuning knobs: --rerank-top-n, --search-limit, --bm25-weight, --min-candidate-length, --no-rerank.
Requirement filter: --requirement (repeatable) to run only specific requirement IDs.
Provider overrides: --embedding, --reranking, --evaluation (repeatable, see above).
Output: --json for machine-readable output, -v / --verbose for per-part match detail.
Session reuse: --session <id> to skip ingestion when reusing an existing Review session.

Cross-references¶

Retrieval pipeline and scoring context: doc/architecture/review.md §5.1.
Golden-assessment authoring rules: .claude/skills/golden-assessment/ (principles, audit checklist).
Heading-aware text composition: vibe/review/document_text.py (combine, part_text, truncate_at_whitespace).