vibe.review.parsing.pipeline¶
Main parsing pipeline orchestrator.
Coordinates the 4-layer pipeline: 1. Extraction: Source → ExtractedWord[] 2. Layout: ExtractedWord[] → LayoutPage[] 3. Structure: LayoutPage[] → DocumentStructure 4. Semantic: DocumentStructure → SemanticDocument
Supports different entry points for different document types: - PDF: Enters at Layer 1 (Extraction) - Markdown/DOCX/HTML: Enter at Layer 3 (Structure) via adapters
ParsingPipeline ¶
Main parsing pipeline.
Provides a unified interface for parsing documents of any supported type through the appropriate pipeline layers.
__init__ ¶
__init__(profile: Profile | None = None, rule_engine: RuleEngine | None = None, ocr_backend: OcrBackend | None | object = _DEFAULT_OCR, use_yolo_layout: bool = True) -> None
Initialize the pipeline.
| Parameters: |
|
|---|
parse ¶
parse(path: Path | str) -> SemanticDocument
Parse a document through the pipeline.
| Parameters: |
|
|---|
| Returns: |
|
|---|
parse_content ¶
parse_content(content: str, source_type: str = 'markdown', source_path: str | None = None) -> SemanticDocument
Parse content string through the pipeline.
| Parameters: |
|
|---|
| Returns: |
|
|---|
parse_content_with_structure ¶
parse_content_with_structure(content: str, source_type: str = 'markdown', source_path: str | None = None) -> tuple[DocumentStructure, SemanticDocument]
Parse content string and return both structure and semantic document.
| Parameters: |
|
|---|
| Returns: |
|
|---|
parse_with_structure ¶
parse_with_structure(path: Path | str) -> tuple[DocumentStructure, SemanticDocument]
Parse a document and return both structure and semantic document.
Useful when callers need fallback access to structure blocks when semantic extraction produces no units.
| Parameters: |
|
|---|
| Returns: |
|
|---|
extract ¶
extract(path: Path, pages: set[int] | None = None, language: str | None = None) -> ExtractionResult
Run only extraction layer (PDF only).
| Parameters: |
|
|---|
| Returns: |
|
|---|
layout ¶
layout(extraction: ExtractionResult, page_progress: Callable[[int, int], None] | None = None) -> list[LayoutPage]
Run layout analysis on extraction result.
structure ¶
structure(layout_pages: list[LayoutPage] | None = None, content: str | None = None, source_type: str = 'pdf', source_path: str | None = None) -> DocumentStructure
Build document structure.
Either from layout pages (PDF path) or content string (adapters).
semantic ¶
semantic(structure: DocumentStructure) -> SemanticDocument
Run semantic extraction on document structure.
get_default_profile ¶
get_default_profile(profile_id: str = 'base') -> Profile
Get a profile by ID from the built-in registry.
This is a convenience factory for loading profiles without manually creating a ProfileRegistry. Use this when you want to configure a pipeline with a standard profile.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Example
pipeline = ParsingPipeline(profile=get_default_profile()) pipeline = ParsingPipeline(profile=get_default_profile("regulatory"))
create_pipeline ¶
create_pipeline(profile_id: str | None = None, profile_registry: ProfileRegistry | None = None, ocr_backend: OcrBackend | None | object = _DEFAULT_OCR, use_yolo_layout: bool = True) -> ParsingPipeline
Create a parsing pipeline with optional profile.
| Parameters: |
|
|---|
| Returns: |
|
|---|