vibe.review.parsing.extraction.ocr.extractor¶

OCR extraction layer - converts OCR results to ExtractedWord[].

This module integrates the OCR backend with the parsing pipeline by converting Tesseract TSV output into ExtractedWord objects with proper coordinates, confidence scores, and provenance tracking.

OcrExtractor ¶

Extract text from scanned documents using OCR.

Converts OCR backend output (Tesseract TSV) into ExtractedWord objects compatible with the rest of the parsing pipeline.

Attributes:	`backend` – The OCR backend to use (e.g., TesseractDockerBackend). `render_dpi` – DPI used for rendering pages (for coordinate conversion).

name ¶

name: str

Return 'ocr' as the extractor identifier.

supported_extensions ¶

supported_extensions: list[str]

Return list of supported file extensions (PDF only for now).

init ¶

__init__(backend: OcrBackend, render_dpi: int | None = None) -> None

Initialize the OCR extractor.

Parameters:	`backend` (`OcrBackend`) – OCR backend instance. `render_dpi` (`int \| None`, default: `None` ) – DPI at which pages are rendered for OCR. If not provided, attempts to get it from the backend.

extract ¶

extract(path: Path, page_numbers: list[int] | None = None, language: str | None = None) -> ExtractionResult

Extract text from a document using OCR.

Parameters:	`path` (`Path`) – Path to the PDF file. `page_numbers` (`list[int] \| None`, default: `None` ) – Specific pages to OCR (1-based). If None, OCR all pages. `language` (`str \| None`, default: `None` ) – Language code for OCR (e.g., "en", "sv").

Returns:	`ExtractionResult` – ExtractionResult with extracted words and page metadata.

Raises:	`ExtractionError` – If OCR fails.

iter_extract ¶

iter_extract(path: Path, page_numbers: list[int] | None = None, language: str | None = None) -> Iterator[tuple[int, int, list[ExtractedWord]]]

Extract text with per-page progress.

Yields:	`tuple[int, int, list[ExtractedWord]]` – Tuples of (page_number, total_pages, words_for_page).