vibe.review.parsing.extraction.ocr.analysis¶

PDF page analysis for OCR routing decisions.

Analyzes PDF pages to determine which require OCR (scanned images) vs which have usable text layers.

PdfPageAnalysis ¶

Analysis results for a single PDF page.

analyze_pdf_pages ¶

analyze_pdf_pages(pdf_path: Path, pages: set[int] | None = None) -> list[PdfPageAnalysis]

Analyze pages in a PDF to determine text/image characteristics.

For each page, determines: - How many text characters are present (excluding whitespace) - The ratio of the largest image to the page area

Parameters:	`pdf_path` (`Path`) – Path to the PDF file. `pages` (`set[int] \| None`, default: `None` ) – Optional set of 1-based page numbers to analyze. If None, all pages are analyzed.

Returns:	`list[PdfPageAnalysis]` – List of PdfPageAnalysis objects, one per analyzed page.

Raises:	`FileNotFoundError` – If the PDF doesn't exist.

select_ocr_pages ¶

select_ocr_pages(analysis: list[PdfPageAnalysis], *, text_layer_min_chars: int, image_ratio_threshold: float = OCR_PAGE_IMAGE_RATIO_THRESHOLD) -> list[int]

Select pages that need OCR based on analysis results.

A page needs OCR if: - It has fewer than text_layer_min_chars of text, AND - It has a large image (ratio > image_ratio_threshold)

Parameters:	`analysis` (`list[PdfPageAnalysis]`) – Page analysis results from analyze_pdf_pages. `text_layer_min_chars` (`int`) – Minimum text chars for a page to be considered text-layer. `image_ratio_threshold` (`float`, default: `OCR_PAGE_IMAGE_RATIO_THRESHOLD` ) – Image-to-page ratio threshold for OCR detection.

Returns:	`list[int]` – List of 1-based page numbers that need OCR.