vibe.review.parsing.pipeline

Main parsing pipeline orchestrator.

Coordinates the 4-layer pipeline: 1. Extraction: Source → ExtractedWord[] 2. Layout: ExtractedWord[] → LayoutPage[] 3. Structure: LayoutPage[] → DocumentStructure 4. Semantic: DocumentStructure → SemanticDocument

Supports different entry points for different document types: - PDF: Enters at Layer 1 (Extraction) - Markdown/DOCX/HTML: Enter at Layer 3 (Structure) via adapters

ParsingPipeline

Main parsing pipeline.

Provides a unified interface for parsing documents of any supported type through the appropriate pipeline layers.

__init__

__init__(profile: Profile | None = None, rule_engine: RuleEngine | None = None, ocr_backend: OcrBackend | None | object = _DEFAULT_OCR, use_yolo_layout: bool = True) -> None

Initialize the pipeline.

Parameters:
  • profile (Profile | None, default: None ) –

    Document family profile to use. Use get_default_profile() for defaults.

  • rule_engine (RuleEngine | None, default: None ) –

    Rule engine with loaded rules. Auto-created if not provided.

  • ocr_backend (OcrBackend | None | object, default: _DEFAULT_OCR ) –

    OCR backend for scanned page extraction. Defaults to tesseract_docker. Options: tesseract_docker, paddleocr_docker. Pass None to disable OCR.

  • use_yolo_layout (bool, default: True ) –

    Whether to use YOLO model for layout detection. Set to False for faster processing (uses rule-based detection).

parse

parse(path: Path | str) -> SemanticDocument

Parse a document through the pipeline.

Parameters:
  • path (Path | str) –

    Path to document file.

Returns:

parse_content

parse_content(content: str, source_type: str = 'markdown', source_path: str | None = None) -> SemanticDocument

Parse content string through the pipeline.

Parameters:
  • content (str) –

    Document content string.

  • source_type (str, default: 'markdown' ) –

    Type of content ("markdown", "html", "text").

  • source_path (str | None, default: None ) –

    Optional path for metadata.

Returns:

parse_content_with_structure

parse_content_with_structure(content: str, source_type: str = 'markdown', source_path: str | None = None) -> tuple[DocumentStructure, SemanticDocument]

Parse content string and return both structure and semantic document.

Parameters:
  • content (str) –

    Document content string.

  • source_type (str, default: 'markdown' ) –

    Type of content ("markdown", "html", "text").

  • source_path (str | None, default: None ) –

    Optional path for metadata.

Returns:

parse_with_structure

parse_with_structure(path: Path | str) -> tuple[DocumentStructure, SemanticDocument]

Parse a document and return both structure and semantic document.

Useful when callers need fallback access to structure blocks when semantic extraction produces no units.

Parameters:
  • path (Path | str) –

    Path to document file.

Returns:

extract

extract(path: Path, pages: set[int] | None = None, language: str | None = None) -> ExtractionResult

Run only extraction layer (PDF only).

Parameters:
  • path (Path) –

    Path to PDF file.

  • pages (set[int] | None, default: None ) –

    Optional set of 1-based page numbers to extract. If None, all pages are extracted.

  • language (str | None, default: None ) –

    Optional OCR language code (e.g., "sv", "en").

Returns:

layout

layout(extraction: ExtractionResult, page_progress: Callable[[int, int], None] | None = None) -> list[LayoutPage]

Run layout analysis on extraction result.

structure

structure(layout_pages: list[LayoutPage] | None = None, content: str | None = None, source_type: str = 'pdf', source_path: str | None = None) -> DocumentStructure

Build document structure.

Either from layout pages (PDF path) or content string (adapters).

semantic

semantic(structure: DocumentStructure) -> SemanticDocument

Run semantic extraction on document structure.

get_default_profile

get_default_profile(profile_id: str = 'base') -> Profile

Get a profile by ID from the built-in registry.

This is a convenience factory for loading profiles without manually creating a ProfileRegistry. Use this when you want to configure a pipeline with a standard profile.

Parameters:
  • profile_id (str, default: 'base' ) –

    Profile ID to load. Defaults to "base".

Returns:
  • Profile

    Loaded Profile instance.

Example

pipeline = ParsingPipeline(profile=get_default_profile()) pipeline = ParsingPipeline(profile=get_default_profile("regulatory"))

create_pipeline

create_pipeline(profile_id: str | None = None, profile_registry: ProfileRegistry | None = None, ocr_backend: OcrBackend | None | object = _DEFAULT_OCR, use_yolo_layout: bool = True) -> ParsingPipeline

Create a parsing pipeline with optional profile.

Parameters:
  • profile_id (str | None, default: None ) –

    ID of profile to use.

  • profile_registry (ProfileRegistry | None, default: None ) –

    Registry to load profile from.

  • ocr_backend (OcrBackend | None | object, default: _DEFAULT_OCR ) –

    OCR backend for scanned page extraction. Defaults to tesseract_docker. Options: tesseract_docker, paddleocr_docker. Pass None to disable OCR.

  • use_yolo_layout (bool, default: True ) –

    Whether to use YOLO model for layout detection. Set to False for faster processing (uses rule-based detection).

Returns: