vibe.review.parsing.extraction.ocr.extractor¶
OCR extraction layer - converts OCR results to ExtractedWord[].
This module integrates the OCR backend with the parsing pipeline by converting Tesseract TSV output into ExtractedWord objects with proper coordinates, confidence scores, and provenance tracking.
OcrExtractor ¶
Extract text from scanned documents using OCR.
Converts OCR backend output (Tesseract TSV) into ExtractedWord objects compatible with the rest of the parsing pipeline.
| Attributes: |
|
|---|
supported_extensions ¶
supported_extensions: list[str]
Return list of supported file extensions (PDF only for now).
__init__ ¶
__init__(backend: OcrBackend, render_dpi: int | None = None) -> None
Initialize the OCR extractor.
| Parameters: |
|
|---|
extract ¶
extract(path: Path, page_numbers: list[int] | None = None, language: str | None = None) -> ExtractionResult
Extract text from a document using OCR.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
iter_extract ¶
iter_extract(path: Path, page_numbers: list[int] | None = None, language: str | None = None) -> Iterator[tuple[int, int, list[ExtractedWord]]]
Extract text with per-page progress.
| Yields: |
|
|---|