vibe.review.parsing.extraction.base¶

Base types for the extraction layer.

ExtractedWord is the fundamental unit - a single word/token with full positional and typographic information. This granularity enables: - Precise layout analysis (column detection, table detection) - Font-based heading detection - Reading order reconstruction

ExtractedWord ¶

A single word/token extracted from a document.

This is the atomic unit of extraction - every piece of text in the document is represented as one or more ExtractedWords with full positional and typographic information.

Attributes:

text (str) –

The word/token text content.
page (int) –

1-based page number.
bbox (BBox) –

Bounding box in points (72 points = 1 inch).
font_name (str | None) –

Font family name (if available).
font_size (float | None) –

Font size in points (if available).
font_flags (int) –

Font style flags (bold, italic, etc.).
color (tuple[float, float, float] | None) –

Text color as RGB tuple (if available).
confidence (float | None) –

OCR confidence score (0.0-1.0), None for text PDFs.
block_id (int | None) –

Source block ID from extractor (for grouping).
line_id (int | None) –

Source line ID from extractor (for grouping).

is_bold ¶

is_bold: bool

Check if font is bold.

is_italic ¶

is_italic: bool

Check if font is italic.

is_monospace ¶

is_monospace: bool

Check if font is monospace.

to_dict ¶

to_dict() -> dict[str, Any]

Convert to dictionary for serialization.

from_dict ¶

from_dict(d: dict[str, Any]) -> ExtractedWord

Create from dictionary.

PageInfo ¶

Metadata about a single page.

to_dict ¶

to_dict() -> dict[str, Any]

Convert to dictionary.

from_dict ¶

from_dict(d: dict[str, Any]) -> PageInfo

Create from dictionary.

ExtractionResult ¶

Result of extracting text from a document.

Contains all extracted words plus metadata about the extraction process and the source document.

Attributes:

words (list[ExtractedWord]) –

All extracted words, ordered by page then reading position.
pages (list[PageInfo]) –

Metadata about each page (dimensions, rotation).
source_path (str | None) –

Path to the source document.
source_type (str) –

Type of source ("pdf", "ocr", "image").
extractor_name (str) –

Name of the extractor used.
metadata (dict[str, Any]) –

Additional extraction metadata.

words_on_page ¶

words_on_page(page: int) -> list[ExtractedWord]

Get all words on a specific page.

page_count ¶

page_count() -> int

Get total number of pages.

word_count ¶

word_count() -> int

Get total number of words.

get_page_info ¶

get_page_info(page: int) -> PageInfo | None

Get info for a specific page (1-based).

to_dict ¶

to_dict() -> dict[str, Any]

Convert to dictionary for serialization.

from_dict ¶

from_dict(d: dict[str, Any]) -> ExtractionResult

Create from dictionary.

Extractor ¶

Abstract base class for document extractors.

Each extractor handles a specific document format and produces an ExtractionResult with words and page metadata.

name ¶

name: str

Unique name identifying this extractor.

supported_extensions ¶

supported_extensions: list[str]

File extensions this extractor handles (e.g., ['.pdf']).

extract ¶

extract(path: Path) -> ExtractionResult

Extract text from a document.

Parameters:	`path` (`Path`) – Path to the document file.

Returns:	`ExtractionResult` – ExtractionResult with all extracted words and metadata.

Raises:	`ExtractionError` – If extraction fails.

iter_extract ¶

iter_extract(path: Path) -> Iterator[tuple[int, int, list[ExtractedWord]]]

Extract text with per-page progress.

Yields:	`tuple[int, int, list[ExtractedWord]]` – Tuples of (page_number, total_pages, words_for_page).

Default implementation extracts all then yields per page. Subclasses can override for true streaming.

can_handle ¶

can_handle(path: Path) -> bool

Check if this extractor can handle a file.

ExtractionError ¶

Raised when document extraction fails.