vibe.review.parsing.extraction.hybrid¶

Hybrid PDF extractor that intelligently routes pages to text or OCR extraction.

For PDFs with mixed content (text + scanned pages), this extractor: 1. Analyzes each page to determine if it's text-layer or scanned 2. Routes text-layer pages to PdfExtractor (fast, accurate) 3. Routes scanned pages to OcrExtractor (slower but necessary) 4. Merges results maintaining proper page order

HybridPdfExtractor ¶

Smart PDF extractor that routes pages to text or OCR based on analysis.

For each page, determines whether to use: - Text extraction (PyMuPDF): Fast, high quality for text-layer PDFs - OCR extraction (Tesseract): Necessary for scanned/image pages

Detection heuristic: - If page has >= text_layer_min_chars of text AND no large image → text layer - If page has single large image (>80% page area) AND few chars → scanned

Attributes:	`pdf_extractor` – Extractor for text-layer pages. `ocr_extractor` – Extractor for scanned pages (optional). `text_layer_min_chars` – Minimum chars to consider page as text-layer. `image_ratio_threshold` – Image/page ratio threshold for OCR detection.

name ¶

name: str

Return 'hybrid' as the extractor identifier.

supported_extensions ¶

supported_extensions: list[str]

Return list of supported file extensions (PDF only).

init ¶

__init__(pdf_extractor: PdfExtractor | None = None, ocr_extractor: OcrExtractor | None = None, text_layer_min_chars: int = 10, image_ratio_threshold: float = 0.8) -> None

Initialize the hybrid extractor.

Parameters:

pdf_extractor (PdfExtractor | None, default: None ) –

Extractor for text-layer pages. Created if not provided.
ocr_extractor (OcrExtractor | None, default: None ) –

Extractor for scanned pages. If None, scanned pages will raise an error or be skipped.
text_layer_min_chars (int, default: 10 ) –

Minimum text chars before page is text-layer.
image_ratio_threshold (float, default: 0.8 ) –

Image-to-page ratio threshold for OCR.

extract ¶

extract(path: Path, language: str | None = None, pages: set[int] | None = None) -> ExtractionResult

Extract text from a PDF, using text or OCR as appropriate per page.

Parameters:	`path` (`Path`) – Path to the PDF file. `language` (`str \| None`, default: `None` ) – Language code for OCR (if needed). `pages` (`set[int] \| None`, default: `None` ) – Optional set of 1-based page numbers to extract. If None, all pages are extracted.

Returns:	`ExtractionResult` – ExtractionResult with extracted words from requested pages.

Raises:	`ExtractionError` – If extraction fails.

iter_extract ¶

iter_extract(path: Path, language: str | None = None) -> Iterator[tuple[int, int, list[ExtractedWord]]]

Extract text with per-page progress.

Yields:	`tuple[int, int, list[ExtractedWord]]` – Tuples of (page_number, total_pages, words_for_page).