vibe.review.parsing.extraction.ocr.backend

OCR backends for VIBE Review.

OCR is intentionally pluggable: scanned PDFs may be processed by a local binary, by running Tesseract in a Docker container, or by a hosted OCR API/vision model.

OcrError

Error raised when OCR processing fails.

OcrPageResult

OCR output for a single page.

OcrResult

OCR output for an entire document.

full_text

full_text: str

Concatenate all page texts with form-feed delimiters.

OcrProgress

Progress update during OCR processing.

Uses BaseProgress fields: - current: Current page number (1-based) - total: Total page count - phase: "render" | "ocr" | "cache" - message: Human-readable status

OcrBackend

Protocol for OCR backends that emit page results and progress.

iter_pdf_pages

iter_pdf_pages(pdf_path: Path, *, language: str | None, page_numbers: Iterable[int] | None = None) -> Iterator[OcrPageResult | OcrProgress]

Iterate over PDF pages, yielding results and progress updates.

Yields OcrProgress for status updates and OcrPageResult for page content. Callers use isinstance() to distinguish between them.

If page_numbers is provided, only those 1-based page numbers are OCR'd.

TesseractDockerBackend

OCR backend that runs Tesseract via docker exec on a running container.

This backend uses the vibe-review-tesseract container from docker-compose.yml. Images are copied to a shared work directory, and Tesseract is invoked via docker exec for efficient batch processing.

Notes: - The container must be running: docker compose up tesseract - Swedish and English languages are pre-installed in the container image - For layout overlays, TSV output provides word positions

iter_pdf_pages

iter_pdf_pages(pdf_path: Path, *, language: str | None, page_numbers: Iterable[int] | None = None) -> Iterator[OcrPageResult | OcrProgress]

Run Tesseract OCR in Docker container, yielding pages and progress.

PaddleOcrDockerBackend

OCR backend that runs PaddleOCR via docker exec on a running container.

This backend uses the vibe-review-paddleocr container from docker-compose.yml. Images are copied to a shared work directory, and PaddleOCR is invoked via docker exec for efficient batch processing.

Notes: - The container must be running: docker compose up paddleocr - English and Chinese models are pre-installed in the container image - PaddleOCR uses deep learning models (PP-OCRv5) for better accuracy on complex documents, but is slower than Tesseract on CPU

iter_pdf_pages

iter_pdf_pages(pdf_path: Path, *, language: str | None, page_numbers: Iterable[int] | None = None) -> Iterator[OcrPageResult | OcrProgress]

Run PaddleOCR in Docker container, yielding pages and progress.

CachingOcrBackend

Cache wrapper around an OCR backend.

iter_pdf_pages

iter_pdf_pages(pdf_path: Path, *, language: str | None, page_numbers: Iterable[int] | None = None) -> Iterator[OcrPageResult | OcrProgress]

Return cached OCR results or delegate to inner backend on cache miss.

clear_pdf_cache

clear_pdf_cache(pdf_path: Path) -> int

Remove all cache entries for a PDF file, returning count removed.

normalize_tesseract_language

normalize_tesseract_language(language: str | None, pdf_path: Path | None = None) -> str

Convert ISO 639-1 language codes to Tesseract's ISO 639-2/T codes.

If language is None and pdf_path is provided, attempts to auto-detect the language from the filename (e.g., "document.sv.pdf" → Swedish).

Supported languages: English, Swedish, German, French, Spanish

normalize_paddleocr_language

normalize_paddleocr_language(language: str | None, pdf_path: Path | None = None) -> str

Convert ISO 639-1 language codes to PaddleOCR language codes.

If language is None and pdf_path is provided, attempts to auto-detect the language from the filename (e.g., "document.sv.pdf" → Swedish).

PaddleOCR supports 80+ languages. Common ones: - en: English/Latin (also covers Swedish, German, French, Spanish, etc.) - ch: Chinese (simplified) - chinese_cht: Chinese (traditional) - japan: Japanese - korean: Korean - german: German (dedicated model, but 'en' works for Latin scripts) - french: French (dedicated model, but 'en' works for Latin scripts)

For most European languages using Latin script, 'en' works well.

ocr_backend_id

ocr_backend_id(backend: OcrBackend) -> str

Return a stable identifier for an OCR backend.

create_ocr_backend

create_ocr_backend(key: str | None = 'tesseract_docker', *, container_name: str | None = None, work_dir: Path | None = None, cache_dir: Path | None | object = _DEFAULT_CACHE, cache_reporter: Callable[[str, Path, str], None] | None = None, render_dpi: int | None = None, tesseract_oem: int | str | None = None, tesseract_psm: int | str | None = None, tesseract_user_words: dict[str, str | Path] | str | Path | None = None) -> OcrBackend | None

Create an OCR backend by key.

Parameters:
  • key (str | None, default: 'tesseract_docker' ) –

    Backend identifier. Defaults to "tesseract_docker". Options: "tesseract_docker", "paddleocr_docker". Pass None or "none" to explicitly disable OCR.

  • container_name (str | None, default: None ) –

    Docker container name. Defaults to "vibe-review-tesseract" for tesseract_docker or "vibe-review-paddleocr" for paddleocr_docker.

  • work_dir (Path | None, default: None ) –

    Shared work directory for images. Defaults to .vibe_data/work/ocr which matches the standard Docker container mount.

  • cache_dir (Path | None | object, default: _DEFAULT_CACHE ) –

    Directory for caching OCR results. Defaults to .vibe_data/cache/ocr. Pass None to disable caching.

  • cache_reporter (Callable[[str, Path, str], None] | None, default: None ) –

    Optional callback for cache status reporting.

  • render_dpi (int | None, default: None ) –

    Override render DPI for page-to-image conversion.

  • tesseract_oem (int | str | None, default: None ) –

    Tesseract OCR engine mode override (0-3).

  • tesseract_psm (int | str | None, default: None ) –

    Tesseract page segmentation mode override (0-13).

  • tesseract_user_words (dict[str, str | Path] | str | Path | None, default: None ) –

    Optional per-language wordlist mapping (or single path).

Returns:
  • OcrBackend | None

    OCR backend instance (wrapped with CachingOcrBackend if cache_dir is set),

  • OcrBackend | None

    or None if key is None/"none"