Skip to content

Parsing Pipeline Architecture

Architectural reference for VIBE's parsing pipeline (vibe/review/parsing).

Overview

  • Four layers: extraction -> layout -> structure -> semantic.
  • PDF uses all layers. Markdown/DOCX/HTML enter at structure via adapters (no dedicated TXT adapter; plain text uses MarkdownAdapter).
  • Intermediate Representation (IR) nodes carry spans, provenance, metadata, and stable IDs.
  • Rule engine applies YAML rules by layer; highest priority match wins per node.
  • Profiles bundle rules and config overrides for document families.

Intermediate Representation (IR)

Location: vibe/review/parsing/ir.py

  • IRNode: id, spans, provenance, metadata.
  • Span: page, bbox, text_range, source.
  • BBox: x0, y0, x1, y1 in points (72 points = 1 inch).
  • ProvenanceEvent: rule_id, rule_name, layer, inputs, outputs, confidence, notes, timestamp.
  • Stable IDs are deterministic hashes of content, position hints, and ancestor context (generate_stable_id).

Layer 1: Extraction (PDF only)

Location: vibe/review/parsing/extraction/

  • ExtractedWord captures text, page, bbox, font info, and OCR confidence.
  • Extractors:
    • PdfExtractor (text-layer PDFs)
    • OcrExtractor (scanned pages)
    • HybridPdfExtractor (routes per page, then merges results)

Layer 2: Layout

Location: vibe/review/parsing/layout/

  • LayoutAnalyzer groups words -> lines -> blocks, detects regions and columns.
  • Output types: LayoutLine, LayoutBlock, LayoutRegion, LayoutPage.
  • Layout rules are applied via RuleEngine (typically to LayoutLine, sometimes to LayoutPage, e.g., YOLO layout detection).
  • YOLO layout segmentation can be enabled by a layout rule tag.
  • Table structure detection (vibe/review/parsing/layout/table_structure.py) uses Microsoft Table Transformer model to detect rows, columns, cells, and column headers.

Layer 3: Structure

Location: vibe/review/parsing/structure/

  • StructureBuilder classifies blocks into StructuredBlock and builds DocumentStructure.
  • Heuristics: heading/list detection, signature detection, body font size.
  • Structure rules can reclassify, tag, and apply deferred merge/split actions.
  • Layout tags and discard flags propagate to StructuredBlock.metadata.
  • Adapters: MarkdownAdapter, DocxAdapter, HtmlAdapter.

Layer 4: Semantic

Location: vibe/review/parsing/semantic/

  • SemanticExtractor produces SemanticUnit, Definition, CrossReference, Party, and SemanticDocument.
  • Handles clause numbering, definition patterns, cross-references, and parties.
  • Tags propagate from structure blocks into semantic units.

Rules System

Location: vibe/review/parsing/rules/

  • YAML rules live in vibe/review/parsing/rules_data/{layout,structure,semantic}/.
  • Rules are layered and sorted by priority; highest priority match wins per node.
  • Actions: tag, set_attribute, classify, discard, promote, demote, merge, split.
  • Optional target_type/target_types restrict rules to node classes.

Profiles

Location: vibe/review/parsing/profiles/

  • ProfileRegistry loads profiles from builtin_dir and optional user_dir.
  • Profiles can extend a parent; rules and overrides are merged (parent first).

Pipeline Orchestrator

Location: vibe/review/parsing/pipeline.py

  • ParsingPipeline.parse(path) -> SemanticDocument.
  • parse_with_structure(path) -> (DocumentStructure, SemanticDocument).
  • parse_content / parse_content_with_structure for string inputs.
  • extract, layout, structure, semantic expose layer outputs.

Data Flow

PDF -> Extraction -> Layout -> Structure -> Semantic
DOCX/Markdown/HTML -> Structure (adapter) -> Semantic