vibe.review.parsing.extraction.hybrid

Hybrid PDF extractor that intelligently routes pages to text or OCR extraction.

For PDFs with mixed content (text + scanned pages), this extractor: 1. Analyzes each page to determine if it's text-layer or scanned 2. Routes text-layer pages to PdfExtractor (fast, accurate) 3. Routes scanned pages to OcrExtractor (slower but necessary) 4. Merges results maintaining proper page order

HybridPdfExtractor

Smart PDF extractor that routes pages to text or OCR based on analysis.

For each page, determines whether to use: - Text extraction (PyMuPDF): Fast, high quality for text-layer PDFs - OCR extraction (Tesseract): Necessary for scanned/image pages

Detection heuristic: - If page has >= text_layer_min_chars of text AND no large image → text layer - If page has single large image (>80% page area) AND few chars → scanned

Attributes:
  • pdf_extractor

    Extractor for text-layer pages.

  • ocr_extractor

    Extractor for scanned pages (optional).

  • text_layer_min_chars

    Minimum chars to consider page as text-layer.

  • image_ratio_threshold

    Image/page ratio threshold for OCR detection.

name

name: str

Return 'hybrid' as the extractor identifier.

supported_extensions

supported_extensions: list[str]

Return list of supported file extensions (PDF only).

__init__

__init__(pdf_extractor: PdfExtractor | None = None, ocr_extractor: OcrExtractor | None = None, text_layer_min_chars: int = 10, image_ratio_threshold: float = 0.8) -> None

Initialize the hybrid extractor.

Parameters:
  • pdf_extractor (PdfExtractor | None, default: None ) –

    Extractor for text-layer pages. Created if not provided.

  • ocr_extractor (OcrExtractor | None, default: None ) –

    Extractor for scanned pages. If None, scanned pages will raise an error or be skipped.

  • text_layer_min_chars (int, default: 10 ) –

    Minimum text chars before page is text-layer.

  • image_ratio_threshold (float, default: 0.8 ) –

    Image-to-page ratio threshold for OCR.

extract

extract(path: Path, language: str | None = None, pages: set[int] | None = None) -> ExtractionResult

Extract text from a PDF, using text or OCR as appropriate per page.

Parameters:
  • path (Path) –

    Path to the PDF file.

  • language (str | None, default: None ) –

    Language code for OCR (if needed).

  • pages (set[int] | None, default: None ) –

    Optional set of 1-based page numbers to extract. If None, all pages are extracted.

Returns:
  • ExtractionResult

    ExtractionResult with extracted words from requested pages.

Raises:

iter_extract

iter_extract(path: Path, language: str | None = None) -> Iterator[tuple[int, int, list[ExtractedWord]]]

Extract text with per-page progress.

Yields:
  • tuple[int, int, list[ExtractedWord]]

    Tuples of (page_number, total_pages, words_for_page).