vibe.review.parsing.extraction.base¶
Base types for the extraction layer.
ExtractedWord is the fundamental unit - a single word/token with full positional and typographic information. This granularity enables: - Precise layout analysis (column detection, table detection) - Font-based heading detection - Reading order reconstruction
ExtractedWord ¶
A single word/token extracted from a document.
This is the atomic unit of extraction - every piece of text in the document is represented as one or more ExtractedWords with full positional and typographic information.
| Attributes: |
|
|---|
PageInfo ¶
ExtractionResult ¶
Result of extracting text from a document.
Contains all extracted words plus metadata about the extraction process and the source document.
| Attributes: |
|
|---|
Extractor ¶
Abstract base class for document extractors.
Each extractor handles a specific document format and produces an ExtractionResult with words and page metadata.
supported_extensions ¶
supported_extensions: list[str]
File extensions this extractor handles (e.g., ['.pdf']).
extract ¶
extract(path: Path) -> ExtractionResult
Extract text from a document.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
iter_extract ¶
iter_extract(path: Path) -> Iterator[tuple[int, int, list[ExtractedWord]]]
Extract text with per-page progress.
| Yields: |
|
|---|
Default implementation extracts all then yields per page. Subclasses can override for true streaming.
ExtractionError ¶
Raised when document extraction fails.