vibe.review.parsing.extraction.ocr.backend¶
OCR backends for VIBE Review.
OCR is intentionally pluggable: scanned PDFs may be processed by a local binary, by running Tesseract in a Docker container, or by a hosted OCR API/vision model.
OcrError ¶
Error raised when OCR processing fails.
OcrPageResult ¶
OCR output for a single page.
OcrResult ¶
OCR output for an entire document.
OcrProgress ¶
Progress update during OCR processing.
Uses BaseProgress fields: - current: Current page number (1-based) - total: Total page count - phase: "render" | "ocr" | "cache" - message: Human-readable status
OcrBackend ¶
Protocol for OCR backends that emit page results and progress.
iter_pdf_pages ¶
iter_pdf_pages(pdf_path: Path, *, language: str | None, page_numbers: Iterable[int] | None = None) -> Iterator[OcrPageResult | OcrProgress]
Iterate over PDF pages, yielding results and progress updates.
Yields OcrProgress for status updates and OcrPageResult for page content. Callers use isinstance() to distinguish between them.
If page_numbers is provided, only those 1-based page numbers are OCR'd.
TesseractDockerBackend ¶
OCR backend that runs Tesseract via docker exec on a running container.
This backend uses the vibe-review-tesseract container from docker-compose.yml. Images are copied to a shared work directory, and Tesseract is invoked via docker exec for efficient batch processing.
Notes: - The container must be running: docker compose up tesseract - Swedish and English languages are pre-installed in the container image - For layout overlays, TSV output provides word positions
iter_pdf_pages ¶
iter_pdf_pages(pdf_path: Path, *, language: str | None, page_numbers: Iterable[int] | None = None) -> Iterator[OcrPageResult | OcrProgress]
Run Tesseract OCR in Docker container, yielding pages and progress.
PaddleOcrDockerBackend ¶
OCR backend that runs PaddleOCR via docker exec on a running container.
This backend uses the vibe-review-paddleocr container from docker-compose.yml. Images are copied to a shared work directory, and PaddleOCR is invoked via docker exec for efficient batch processing.
Notes: - The container must be running: docker compose up paddleocr - English and Chinese models are pre-installed in the container image - PaddleOCR uses deep learning models (PP-OCRv5) for better accuracy on complex documents, but is slower than Tesseract on CPU
iter_pdf_pages ¶
iter_pdf_pages(pdf_path: Path, *, language: str | None, page_numbers: Iterable[int] | None = None) -> Iterator[OcrPageResult | OcrProgress]
Run PaddleOCR in Docker container, yielding pages and progress.
CachingOcrBackend ¶
Cache wrapper around an OCR backend.
iter_pdf_pages ¶
iter_pdf_pages(pdf_path: Path, *, language: str | None, page_numbers: Iterable[int] | None = None) -> Iterator[OcrPageResult | OcrProgress]
Return cached OCR results or delegate to inner backend on cache miss.
clear_pdf_cache ¶
clear_pdf_cache(pdf_path: Path) -> int
Remove all cache entries for a PDF file, returning count removed.
normalize_tesseract_language ¶
normalize_tesseract_language(language: str | None, pdf_path: Path | None = None) -> str
Convert ISO 639-1 language codes to Tesseract's ISO 639-2/T codes.
If language is None and pdf_path is provided, attempts to auto-detect the language from the filename (e.g., "document.sv.pdf" → Swedish).
Supported languages: English, Swedish, German, French, Spanish
normalize_paddleocr_language ¶
normalize_paddleocr_language(language: str | None, pdf_path: Path | None = None) -> str
Convert ISO 639-1 language codes to PaddleOCR language codes.
If language is None and pdf_path is provided, attempts to auto-detect the language from the filename (e.g., "document.sv.pdf" → Swedish).
PaddleOCR supports 80+ languages. Common ones: - en: English/Latin (also covers Swedish, German, French, Spanish, etc.) - ch: Chinese (simplified) - chinese_cht: Chinese (traditional) - japan: Japanese - korean: Korean - german: German (dedicated model, but 'en' works for Latin scripts) - french: French (dedicated model, but 'en' works for Latin scripts)
For most European languages using Latin script, 'en' works well.
ocr_backend_id ¶
ocr_backend_id(backend: OcrBackend) -> str
Return a stable identifier for an OCR backend.
create_ocr_backend ¶
create_ocr_backend(key: str | None = 'tesseract_docker', *, container_name: str | None = None, work_dir: Path | None = None, cache_dir: Path | None | object = _DEFAULT_CACHE, cache_reporter: Callable[[str, Path, str], None] | None = None, render_dpi: int | None = None, tesseract_oem: int | str | None = None, tesseract_psm: int | str | None = None, tesseract_user_words: dict[str, str | Path] | str | Path | None = None) -> OcrBackend | None
Create an OCR backend by key.
| Parameters: |
|
|---|
| Returns: |
|
|---|