vibe.review.parsing.extraction.ocr.analysis¶
PDF page analysis for OCR routing decisions.
Analyzes PDF pages to determine which require OCR (scanned images) vs which have usable text layers.
PdfPageAnalysis ¶
Analysis results for a single PDF page.
analyze_pdf_pages ¶
analyze_pdf_pages(pdf_path: Path, pages: set[int] | None = None) -> list[PdfPageAnalysis]
Analyze pages in a PDF to determine text/image characteristics.
For each page, determines: - How many text characters are present (excluding whitespace) - The ratio of the largest image to the page area
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
select_ocr_pages ¶
select_ocr_pages(analysis: list[PdfPageAnalysis], *, text_layer_min_chars: int, image_ratio_threshold: float = OCR_PAGE_IMAGE_RATIO_THRESHOLD) -> list[int]
Select pages that need OCR based on analysis results.
A page needs OCR if: - It has fewer than text_layer_min_chars of text, AND - It has a large image (ratio > image_ratio_threshold)
| Parameters: |
|
|---|
| Returns: |
|
|---|