vibe.review.parsing.extraction.pdf¶

PDF text extraction using PyMuPDF.

Extracts words with full positional and typographic information from text-layer PDFs. For scanned/image PDFs, use OCR extraction instead.

PdfExtractor ¶

Extract text from PDF files using PyMuPDF (fitz).

This extractor produces word-level output with full positional and typographic information. It's optimized for text-layer PDFs; for scanned documents, use OCR extraction.

Attributes:	`extract_fonts` – Whether to extract font information. `extract_colors` – Whether to extract text colors.

name ¶

name: str

Return 'pymupdf' as the extractor identifier.

supported_extensions ¶

supported_extensions: list[str]

Return list of supported file extensions (PDF only).

init ¶

__init__(extract_fonts: bool = True, extract_colors: bool = False) -> None

Initialize the PDF extractor.

Parameters:	`extract_fonts` (`bool`, default: `True` ) – If True, include font name, size, and flags. `extract_colors` (`bool`, default: `False` ) – If True, include text color information.

extract ¶

extract(path: Path, pages: set[int] | None = None) -> ExtractionResult

Extract words from a PDF.

Parameters:	`path` (`Path`) – Path to the PDF file. `pages` (`set[int] \| None`, default: `None` ) – Optional set of 1-based page numbers to extract. If None, all pages are extracted.

Returns:	`ExtractionResult` – ExtractionResult with extracted words and page metadata.

Raises:	`ExtractionError` – If extraction fails.

iter_extract ¶

iter_extract(path: Path) -> Iterator[tuple[int, int, list[ExtractedWord]]]

Extract words with per-page progress.

Yields:	`tuple[int, int, list[ExtractedWord]]` – Tuples of (page_number, total_pages, words_for_page).