Parsing Pipeline Architecture¶
Architectural reference for VIBE's parsing pipeline (vibe/review/parsing).
Overview¶
- Four layers: extraction -> layout -> structure -> semantic.
- PDF uses all layers. Markdown/DOCX/HTML enter at structure via adapters (no dedicated TXT adapter; plain text uses MarkdownAdapter).
- Intermediate Representation (IR) nodes carry spans, provenance, metadata, and stable IDs.
- Rule engine applies YAML rules by layer; highest priority match wins per node.
- Profiles bundle rules and config overrides for document families.
Intermediate Representation (IR)¶
Location: vibe/review/parsing/ir.py
IRNode:id,spans,provenance,metadata.Span:page,bbox,text_range,source.BBox:x0,y0,x1,y1in points (72 points = 1 inch).ProvenanceEvent:rule_id,rule_name,layer,inputs,outputs,confidence,notes,timestamp.- Stable IDs are deterministic hashes of content, position hints, and ancestor context
(
generate_stable_id).
Layer 1: Extraction (PDF only)¶
Location: vibe/review/parsing/extraction/
ExtractedWordcaptures text, page, bbox, font info, and OCR confidence.- Extractors:
PdfExtractor(text-layer PDFs)OcrExtractor(scanned pages)HybridPdfExtractor(routes per page, then merges results)
Layer 2: Layout¶
Location: vibe/review/parsing/layout/
LayoutAnalyzergroups words -> lines -> blocks, detects regions and columns.- Output types:
LayoutLine,LayoutBlock,LayoutRegion,LayoutPage. - Layout rules are applied via
RuleEngine(typically toLayoutLine, sometimes toLayoutPage, e.g., YOLO layout detection). - YOLO layout segmentation can be enabled by a layout rule tag.
- Table structure detection (
vibe/review/parsing/layout/table_structure.py) uses Microsoft Table Transformer model to detect rows, columns, cells, and column headers.
Layer 3: Structure¶
Location: vibe/review/parsing/structure/
StructureBuilderclassifies blocks intoStructuredBlockand buildsDocumentStructure.- Heuristics: heading/list detection, signature detection, body font size.
- Structure rules can reclassify, tag, and apply deferred merge/split actions.
- Layout tags and discard flags propagate to
StructuredBlock.metadata. - Adapters:
MarkdownAdapter,DocxAdapter,HtmlAdapter.
Layer 4: Semantic¶
Location: vibe/review/parsing/semantic/
SemanticExtractorproducesSemanticUnit,Definition,CrossReference,Party, andSemanticDocument.- Handles clause numbering, definition patterns, cross-references, and parties.
- Tags propagate from structure blocks into semantic units.
Rules System¶
Location: vibe/review/parsing/rules/
- YAML rules live in
vibe/review/parsing/rules_data/{layout,structure,semantic}/. - Rules are layered and sorted by priority; highest priority match wins per node.
- Actions:
tag,set_attribute,classify,discard,promote,demote,merge,split. - Optional
target_type/target_typesrestrict rules to node classes.
Profiles¶
Location: vibe/review/parsing/profiles/
ProfileRegistryloads profiles frombuiltin_dirand optionaluser_dir.- Profiles can extend a parent; rules and overrides are merged (parent first).
Pipeline Orchestrator¶
Location: vibe/review/parsing/pipeline.py
ParsingPipeline.parse(path)->SemanticDocument.parse_with_structure(path)->(DocumentStructure, SemanticDocument).parse_content/parse_content_with_structurefor string inputs.extract,layout,structure,semanticexpose layer outputs.