vibe.review.parsing.extraction.pdf¶
PDF text extraction using PyMuPDF.
Extracts words with full positional and typographic information from text-layer PDFs. For scanned/image PDFs, use OCR extraction instead.
PdfExtractor ¶
Extract text from PDF files using PyMuPDF (fitz).
This extractor produces word-level output with full positional and typographic information. It's optimized for text-layer PDFs; for scanned documents, use OCR extraction.
| Attributes: |
|
|---|
supported_extensions ¶
supported_extensions: list[str]
Return list of supported file extensions (PDF only).
__init__ ¶
extract ¶
extract(path: Path, pages: set[int] | None = None) -> ExtractionResult
Extract words from a PDF.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
iter_extract ¶
iter_extract(path: Path) -> Iterator[tuple[int, int, list[ExtractedWord]]]
Extract words with per-page progress.
| Yields: |
|
|---|