vibe.review.parsing.extraction.ocr.extractor

OCR extraction layer - converts OCR results to ExtractedWord[].

This module integrates the OCR backend with the parsing pipeline by converting Tesseract TSV output into ExtractedWord objects with proper coordinates, confidence scores, and provenance tracking.

OcrExtractor

Extract text from scanned documents using OCR.

Converts OCR backend output (Tesseract TSV) into ExtractedWord objects compatible with the rest of the parsing pipeline.

Attributes:
  • backend

    The OCR backend to use (e.g., TesseractDockerBackend).

  • render_dpi

    DPI used for rendering pages (for coordinate conversion).

name

name: str

Return 'ocr' as the extractor identifier.

supported_extensions

supported_extensions: list[str]

Return list of supported file extensions (PDF only for now).

__init__

__init__(backend: OcrBackend, render_dpi: int | None = None) -> None

Initialize the OCR extractor.

Parameters:
  • backend (OcrBackend) –

    OCR backend instance.

  • render_dpi (int | None, default: None ) –

    DPI at which pages are rendered for OCR. If not provided, attempts to get it from the backend.

extract

extract(path: Path, page_numbers: list[int] | None = None, language: str | None = None) -> ExtractionResult

Extract text from a document using OCR.

Parameters:
  • path (Path) –

    Path to the PDF file.

  • page_numbers (list[int] | None, default: None ) –

    Specific pages to OCR (1-based). If None, OCR all pages.

  • language (str | None, default: None ) –

    Language code for OCR (e.g., "en", "sv").

Returns:
Raises:

iter_extract

iter_extract(path: Path, page_numbers: list[int] | None = None, language: str | None = None) -> Iterator[tuple[int, int, list[ExtractedWord]]]

Extract text with per-page progress.

Yields:
  • tuple[int, int, list[ExtractedWord]]

    Tuples of (page_number, total_pages, words_for_page).