vibe.review.parsing.extraction.ocr.analysis

PDF page analysis for OCR routing decisions.

Analyzes PDF pages to determine which require OCR (scanned images) vs which have usable text layers.

PdfPageAnalysis

Analysis results for a single PDF page.

analyze_pdf_pages

analyze_pdf_pages(pdf_path: Path, pages: set[int] | None = None) -> list[PdfPageAnalysis]

Analyze pages in a PDF to determine text/image characteristics.

For each page, determines: - How many text characters are present (excluding whitespace) - The ratio of the largest image to the page area

Parameters:
  • pdf_path (Path) –

    Path to the PDF file.

  • pages (set[int] | None, default: None ) –

    Optional set of 1-based page numbers to analyze. If None, all pages are analyzed.

Returns:
Raises:
  • FileNotFoundError

    If the PDF doesn't exist.

select_ocr_pages

select_ocr_pages(analysis: list[PdfPageAnalysis], *, text_layer_min_chars: int, image_ratio_threshold: float = OCR_PAGE_IMAGE_RATIO_THRESHOLD) -> list[int]

Select pages that need OCR based on analysis results.

A page needs OCR if: - It has fewer than text_layer_min_chars of text, AND - It has a large image (ratio > image_ratio_threshold)

Parameters:
  • analysis (list[PdfPageAnalysis]) –

    Page analysis results from analyze_pdf_pages.

  • text_layer_min_chars (int) –

    Minimum text chars for a page to be considered text-layer.

  • image_ratio_threshold (float, default: OCR_PAGE_IMAGE_RATIO_THRESHOLD ) –

    Image-to-page ratio threshold for OCR detection.

Returns:
  • list[int]

    List of 1-based page numbers that need OCR.