Skip to content

Review Parsing Pipeline (Developer)

Practical reference for the parsing pipeline and vibe-dev review parse-doc.

Quick Start (CLI)

Parse without DB storage (default output is parts):

vibe-dev review parse-doc contract.pdf

Common options:

vibe-dev review parse-doc contract.pdf --level extraction --pages 1-2
vibe-dev review parse-doc contract.pdf --level layout
vibe-dev review parse-doc contract.pdf --level structure
vibe-dev review parse-doc contract.pdf --level semantic
vibe-dev review parse-doc contract.pdf --level parts
vibe-dev review parse-doc contract.pdf --profile legal_contracts
vibe-dev review parse-doc scanned.pdf --ocr-backend tesseract_docker
vibe-dev review parse-doc contract.pdf --display --no-color
vibe-dev review parse-doc contract.pdf --transitions

Validate output against golden records:

vibe-dev review validate-parse contract.pdf
vibe-dev review validate-parse contract.pdf --mode exact
vibe-dev review validate-parse contract.pdf --golden expected.golden.json
vibe-dev review validate-parse example_contracts/ -r

Notes:

  • --level choices: extraction, layout, structure, semantic, parts (default).
  • --pages supports 1, 1,3,5, 1-5, 1-3,5,7-10 (PDF only).
  • --transitions prints provenance transitions instead of data.

Programmatic Usage

from vibe.review.parsing import ParsingPipeline

pipeline = ParsingPipeline(profile_id="base")
semantic = pipeline.parse("contract.pdf")

Access intermediate layers:

from pathlib import Path

extraction = pipeline.extract(Path("contract.pdf"))
layout_pages = pipeline.layout(extraction)
structure = pipeline.structure(layout_pages)
semantic = pipeline.semantic(structure)

Parse string content:

semantic = pipeline.parse_content("# Title\n\nText...", source_type="markdown")
structure, semantic = pipeline.parse_content_with_structure("<h1>Title</h1>", source_type="html")

Output to DB / Parts

  • SemanticUnit.to_db_fields() returns a dict for DocumentPartModel persistence.
  • Ingestion uses these fields in vibe/review/ingestion.py.
  • parse-doc --level parts outputs part dicts intended for golden record validation and review tooling (not a persisted schema).

Rules (YAML)

Rules are loaded from vibe/review/parsing/rules_data/{layout,structure,semantic}/.

Minimal rule shape:

rule_id: structure_bold_heading
name: "Bold heading"
layer: structure
priority: 110
match:
  predicates:
    - "node.is_bold"
    - "len(node.text) < 80"
actions:
  - action: classify
    block_type: heading

Key points:

  • layer is validated (extraction, layout, structure, semantic).
  • Highest priority match wins per node.
  • Optional target_type or target_types restrict rules to node class names (e.g., LayoutLine, StructuredBlock).
  • Predicates are Python expressions with node, ctx, and re. ctx is whatever you pass to RuleEngine(context=...) (empty by default).
  • Predicate helpers: has_verb(text, language="sv"), next(), previous(), node_index(), nodes_count().
  • predicate_function requires RuleEngine(predicate_functions_dir=...) and a predicates.py in that directory.

Actions:

  • tag, set_attribute, classify, discard, promote, demote, merge, split.
  • merge and split are deferred; call apply_deferred_actions() after apply() when running the engine directly.

Profiles

Profiles bundle rules and config overrides:

profile_id: two_column_law_firm
extends: base
rules:
  - rules_data/layout/header_footer.yml
overrides:
  layout:
    column_gap_threshold: 25

Load with:

from pathlib import Path
from vibe.review.parsing.profiles import ProfileRegistry

registry = ProfileRegistry(
    builtin_dir=Path("vibe/review/parsing/profiles"),
    user_dir=Path("path/to/custom/profiles"),
)
profile = registry.resolve("two_column_law_firm")

Node Types (Essentials)

  • Extraction: ExtractedWord(text, page, bbox, font_*)
  • Layout: LayoutLine, LayoutBlock(lines, column_index, indent_level, region_type), LayoutPage(page_number, blocks, regions, column_detection)
  • Structure: StructuredBlock(text, block_type, level, list_type, list_marker, line_count, indent_level, is_bold)
  • Semantic: SemanticUnit(number, title, content, part_type, level, parent_id, source_block_ids)

Key Modules

  • vibe/review/parsing/pipeline.py
  • vibe/review/parsing/ir.py
  • vibe/review/parsing/{extraction,layout,structure,semantic}/
  • vibe/review/parsing/rules/ and vibe/review/parsing/rules_data/