vibe.review.parsing.extraction.pdf

PDF text extraction using PyMuPDF.

Extracts words with full positional and typographic information from text-layer PDFs. For scanned/image PDFs, use OCR extraction instead.

PdfExtractor

Extract text from PDF files using PyMuPDF (fitz).

This extractor produces word-level output with full positional and typographic information. It's optimized for text-layer PDFs; for scanned documents, use OCR extraction.

Attributes:
  • extract_fonts

    Whether to extract font information.

  • extract_colors

    Whether to extract text colors.

name

name: str

Return 'pymupdf' as the extractor identifier.

supported_extensions

supported_extensions: list[str]

Return list of supported file extensions (PDF only).

__init__

__init__(extract_fonts: bool = True, extract_colors: bool = False) -> None

Initialize the PDF extractor.

Parameters:
  • extract_fonts (bool, default: True ) –

    If True, include font name, size, and flags.

  • extract_colors (bool, default: False ) –

    If True, include text color information.

extract

extract(path: Path, pages: set[int] | None = None) -> ExtractionResult

Extract words from a PDF.

Parameters:
  • path (Path) –

    Path to the PDF file.

  • pages (set[int] | None, default: None ) –

    Optional set of 1-based page numbers to extract. If None, all pages are extracted.

Returns:
Raises:

iter_extract

iter_extract(path: Path) -> Iterator[tuple[int, int, list[ExtractedWord]]]

Extract words with per-page progress.

Yields:
  • tuple[int, int, list[ExtractedWord]]

    Tuples of (page_number, total_pages, words_for_page).