vibe.review.parsing.extraction.base

Base types for the extraction layer.

ExtractedWord is the fundamental unit - a single word/token with full positional and typographic information. This granularity enables: - Precise layout analysis (column detection, table detection) - Font-based heading detection - Reading order reconstruction

ExtractedWord

A single word/token extracted from a document.

This is the atomic unit of extraction - every piece of text in the document is represented as one or more ExtractedWords with full positional and typographic information.

Attributes:
  • text (str) –

    The word/token text content.

  • page (int) –

    1-based page number.

  • bbox (BBox) –

    Bounding box in points (72 points = 1 inch).

  • font_name (str | None) –

    Font family name (if available).

  • font_size (float | None) –

    Font size in points (if available).

  • font_flags (int) –

    Font style flags (bold, italic, etc.).

  • color (tuple[float, float, float] | None) –

    Text color as RGB tuple (if available).

  • confidence (float | None) –

    OCR confidence score (0.0-1.0), None for text PDFs.

  • block_id (int | None) –

    Source block ID from extractor (for grouping).

  • line_id (int | None) –

    Source line ID from extractor (for grouping).

is_bold

is_bold: bool

Check if font is bold.

is_italic

is_italic: bool

Check if font is italic.

is_monospace

is_monospace: bool

Check if font is monospace.

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for serialization.

from_dict

from_dict(d: dict[str, Any]) -> ExtractedWord

Create from dictionary.

PageInfo

Metadata about a single page.

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary.

from_dict

from_dict(d: dict[str, Any]) -> PageInfo

Create from dictionary.

ExtractionResult

Result of extracting text from a document.

Contains all extracted words plus metadata about the extraction process and the source document.

Attributes:
  • words (list[ExtractedWord]) –

    All extracted words, ordered by page then reading position.

  • pages (list[PageInfo]) –

    Metadata about each page (dimensions, rotation).

  • source_path (str | None) –

    Path to the source document.

  • source_type (str) –

    Type of source ("pdf", "ocr", "image").

  • extractor_name (str) –

    Name of the extractor used.

  • metadata (dict[str, Any]) –

    Additional extraction metadata.

words_on_page

words_on_page(page: int) -> list[ExtractedWord]

Get all words on a specific page.

page_count

page_count() -> int

Get total number of pages.

word_count

word_count() -> int

Get total number of words.

get_page_info

get_page_info(page: int) -> PageInfo | None

Get info for a specific page (1-based).

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for serialization.

from_dict

from_dict(d: dict[str, Any]) -> ExtractionResult

Create from dictionary.

Extractor

Abstract base class for document extractors.

Each extractor handles a specific document format and produces an ExtractionResult with words and page metadata.

name

name: str

Unique name identifying this extractor.

supported_extensions

supported_extensions: list[str]

File extensions this extractor handles (e.g., ['.pdf']).

extract

extract(path: Path) -> ExtractionResult

Extract text from a document.

Parameters:
  • path (Path) –

    Path to the document file.

Returns:
Raises:

iter_extract

iter_extract(path: Path) -> Iterator[tuple[int, int, list[ExtractedWord]]]

Extract text with per-page progress.

Yields:
  • tuple[int, int, list[ExtractedWord]]

    Tuples of (page_number, total_pages, words_for_page).

Default implementation extracts all then yields per page. Subclasses can override for true streaming.

can_handle

can_handle(path: Path) -> bool

Check if this extractor can handle a file.

ExtractionError

Raised when document extraction fails.