vibe.review.parsing.pipeline¶

Main parsing pipeline orchestrator.

Coordinates the 4-layer pipeline: 1. Extraction: Source → ExtractedWord[] 2. Layout: ExtractedWord[] → LayoutPage[] 3. Structure: LayoutPage[] → DocumentStructure 4. Semantic: DocumentStructure → SemanticDocument

Supports different entry points for different document types: - PDF: Enters at Layer 1 (Extraction) - Markdown/DOCX/HTML: Enter at Layer 3 (Structure) via adapters

ParsingPipeline ¶

Main parsing pipeline.

Provides a unified interface for parsing documents of any supported type through the appropriate pipeline layers.

init ¶

__init__(profile: Profile | None = None, rule_engine: RuleEngine | None = None, ocr_backend: OcrBackend | None | object = _DEFAULT_OCR, use_yolo_layout: bool = True) -> None

Initialize the pipeline.

Parameters:

profile (Profile | None, default: None ) –

Document family profile to use. Use get_default_profile() for defaults.
rule_engine (RuleEngine | None, default: None ) –

Rule engine with loaded rules. Auto-created if not provided.
ocr_backend (OcrBackend | None | object, default: _DEFAULT_OCR ) –

OCR backend for scanned page extraction. Defaults to tesseract_docker. Options: tesseract_docker, paddleocr_docker. Pass None to disable OCR.
use_yolo_layout (bool, default: True ) –

Whether to use YOLO model for layout detection. Set to False for faster processing (uses rule-based detection).

parse ¶

parse(path: Path | str) -> SemanticDocument

Parse a document through the pipeline.

Parameters:	`path` (`Path \| str`) – Path to document file.

Returns:	`SemanticDocument` – SemanticDocument with full semantic analysis.

parse_content ¶

parse_content(content: str, source_type: str = 'markdown', source_path: str | None = None) -> SemanticDocument

Parse content string through the pipeline.

Parameters:	`content` (`str`) – Document content string. `source_type` (`str`, default: `'markdown'` ) – Type of content ("markdown", "html", "text"). `source_path` (`str \| None`, default: `None` ) – Optional path for metadata.

Returns:	`SemanticDocument` – SemanticDocument.

parse_content_with_structure ¶

parse_content_with_structure(content: str, source_type: str = 'markdown', source_path: str | None = None) -> tuple[DocumentStructure, SemanticDocument]

Parse content string and return both structure and semantic document.

Parameters:	`content` (`str`) – Document content string. `source_type` (`str`, default: `'markdown'` ) – Type of content ("markdown", "html", "text"). `source_path` (`str \| None`, default: `None` ) – Optional path for metadata.

Returns:	`tuple[DocumentStructure, SemanticDocument]` – Tuple of (DocumentStructure, SemanticDocument).

parse_with_structure ¶

parse_with_structure(path: Path | str) -> tuple[DocumentStructure, SemanticDocument]

Parse a document and return both structure and semantic document.

Useful when callers need fallback access to structure blocks when semantic extraction produces no units.

Parameters:	`path` (`Path \| str`) – Path to document file.

Returns:	`tuple[DocumentStructure, SemanticDocument]` – Tuple of (DocumentStructure, SemanticDocument).

extract ¶

extract(path: Path, pages: set[int] | None = None, language: str | None = None) -> ExtractionResult

Run only extraction layer (PDF only).

Parameters:	`path` (`Path`) – Path to PDF file. `pages` (`set[int] \| None`, default: `None` ) – Optional set of 1-based page numbers to extract. If None, all pages are extracted. `language` (`str \| None`, default: `None` ) – Optional OCR language code (e.g., "sv", "en").

Returns:	`ExtractionResult` – ExtractionResult with extracted words.

layout ¶

layout(extraction: ExtractionResult, page_progress: Callable[[int, int], None] | None = None) -> list[LayoutPage]

Run layout analysis on extraction result.

structure ¶

structure(layout_pages: list[LayoutPage] | None = None, content: str | None = None, source_type: str = 'pdf', source_path: str | None = None) -> DocumentStructure

Build document structure.

Either from layout pages (PDF path) or content string (adapters).

semantic ¶

semantic(structure: DocumentStructure) -> SemanticDocument

Run semantic extraction on document structure.

get_default_profile ¶

get_default_profile(profile_id: str = 'base') -> Profile

Get a profile by ID from the built-in registry.

This is a convenience factory for loading profiles without manually creating a ProfileRegistry. Use this when you want to configure a pipeline with a standard profile.

Parameters:	`profile_id` (`str`, default: `'base'` ) – Profile ID to load. Defaults to "base".

Returns:	`Profile` – Loaded Profile instance.

Example

pipeline = ParsingPipeline(profile=get_default_profile()) pipeline = ParsingPipeline(profile=get_default_profile("regulatory"))

create_pipeline ¶

create_pipeline(profile_id: str | None = None, profile_registry: ProfileRegistry | None = None, ocr_backend: OcrBackend | None | object = _DEFAULT_OCR, use_yolo_layout: bool = True) -> ParsingPipeline

Create a parsing pipeline with optional profile.