Review Parsing Pipeline (Developer)¶
Practical reference for the parsing pipeline and vibe-dev review parse-doc.
Quick Start (CLI)¶
Parse without DB storage (default output is parts):
Common options:
vibe-dev review parse-doc contract.pdf --level extraction --pages 1-2
vibe-dev review parse-doc contract.pdf --level layout
vibe-dev review parse-doc contract.pdf --level structure
vibe-dev review parse-doc contract.pdf --level semantic
vibe-dev review parse-doc contract.pdf --level parts
vibe-dev review parse-doc contract.pdf --profile legal_contracts
vibe-dev review parse-doc scanned.pdf --ocr-backend tesseract_docker
vibe-dev review parse-doc contract.pdf --display --no-color
vibe-dev review parse-doc contract.pdf --transitions
Validate output against golden records:
vibe-dev review validate-parse contract.pdf
vibe-dev review validate-parse contract.pdf --mode exact
vibe-dev review validate-parse contract.pdf --golden expected.golden.json
vibe-dev review validate-parse example_contracts/ -r
Notes:
--levelchoices:extraction,layout,structure,semantic,parts(default).--pagessupports1,1,3,5,1-5,1-3,5,7-10(PDF only).--transitionsprints provenance transitions instead of data.
Programmatic Usage¶
from vibe.review.parsing import ParsingPipeline
pipeline = ParsingPipeline(profile_id="base")
semantic = pipeline.parse("contract.pdf")
Access intermediate layers:
from pathlib import Path
extraction = pipeline.extract(Path("contract.pdf"))
layout_pages = pipeline.layout(extraction)
structure = pipeline.structure(layout_pages)
semantic = pipeline.semantic(structure)
Parse string content:
semantic = pipeline.parse_content("# Title\n\nText...", source_type="markdown")
structure, semantic = pipeline.parse_content_with_structure("<h1>Title</h1>", source_type="html")
Output to DB / Parts¶
SemanticUnit.to_db_fields()returns a dict forDocumentPartModelpersistence.- Ingestion uses these fields in
vibe/review/ingestion.py. parse-doc --level partsoutputs part dicts intended for golden record validation and review tooling (not a persisted schema).
Rules (YAML)¶
Rules are loaded from vibe/review/parsing/rules_data/{layout,structure,semantic}/.
Minimal rule shape:
rule_id: structure_bold_heading
name: "Bold heading"
layer: structure
priority: 110
match:
predicates:
- "node.is_bold"
- "len(node.text) < 80"
actions:
- action: classify
block_type: heading
Key points:
layeris validated (extraction,layout,structure,semantic).- Highest priority match wins per node.
- Optional
target_typeortarget_typesrestrict rules to node class names (e.g.,LayoutLine,StructuredBlock). - Predicates are Python expressions with
node,ctx, andre.ctxis whatever you pass toRuleEngine(context=...)(empty by default). - Predicate helpers:
has_verb(text, language="sv"),next(),previous(),node_index(),nodes_count(). predicate_functionrequiresRuleEngine(predicate_functions_dir=...)and apredicates.pyin that directory.
Actions:
tag,set_attribute,classify,discard,promote,demote,merge,split.mergeandsplitare deferred; callapply_deferred_actions()afterapply()when running the engine directly.
Profiles¶
Profiles bundle rules and config overrides:
profile_id: two_column_law_firm
extends: base
rules:
- rules_data/layout/header_footer.yml
overrides:
layout:
column_gap_threshold: 25
Load with:
from pathlib import Path
from vibe.review.parsing.profiles import ProfileRegistry
registry = ProfileRegistry(
builtin_dir=Path("vibe/review/parsing/profiles"),
user_dir=Path("path/to/custom/profiles"),
)
profile = registry.resolve("two_column_law_firm")
Node Types (Essentials)¶
- Extraction:
ExtractedWord(text, page, bbox, font_*) - Layout:
LayoutLine,LayoutBlock(lines, column_index, indent_level, region_type),LayoutPage(page_number, blocks, regions, column_detection) - Structure:
StructuredBlock(text, block_type, level, list_type, list_marker, line_count, indent_level, is_bold) - Semantic:
SemanticUnit(number, title, content, part_type, level, parent_id, source_block_ids)
Key Modules¶
vibe/review/parsing/pipeline.pyvibe/review/parsing/ir.pyvibe/review/parsing/{extraction,layout,structure,semantic}/vibe/review/parsing/rules/andvibe/review/parsing/rules_data/