vibe.review.parsing.extraction.hybrid¶
Hybrid PDF extractor that intelligently routes pages to text or OCR extraction.
For PDFs with mixed content (text + scanned pages), this extractor: 1. Analyzes each page to determine if it's text-layer or scanned 2. Routes text-layer pages to PdfExtractor (fast, accurate) 3. Routes scanned pages to OcrExtractor (slower but necessary) 4. Merges results maintaining proper page order
HybridPdfExtractor ¶
Smart PDF extractor that routes pages to text or OCR based on analysis.
For each page, determines whether to use: - Text extraction (PyMuPDF): Fast, high quality for text-layer PDFs - OCR extraction (Tesseract): Necessary for scanned/image pages
Detection heuristic: - If page has >= text_layer_min_chars of text AND no large image → text layer - If page has single large image (>80% page area) AND few chars → scanned
| Attributes: |
|
|---|
supported_extensions ¶
supported_extensions: list[str]
Return list of supported file extensions (PDF only).
__init__ ¶
__init__(pdf_extractor: PdfExtractor | None = None, ocr_extractor: OcrExtractor | None = None, text_layer_min_chars: int = 10, image_ratio_threshold: float = 0.8) -> None
Initialize the hybrid extractor.
| Parameters: |
|
|---|
extract ¶
extract(path: Path, language: str | None = None, pages: set[int] | None = None) -> ExtractionResult
Extract text from a PDF, using text or OCR as appropriate per page.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
iter_extract ¶
iter_extract(path: Path, language: str | None = None) -> Iterator[tuple[int, int, list[ExtractedWord]]]
Extract text with per-page progress.
| Yields: |
|
|---|