vibe.review.document_sources¶

Document sources for unified ingestion streaming.

Each DocumentSource encapsulates all document-type-specific complexity, yielding progress updates and a final IngestionResult. The service layer has one unified streaming method that iterates, converts progress to SSE, and handles embeddings after receiving the result.

Flow for each source: - MarkdownSource: parse → store → yield IngestionResult - PdfSource: analyze → extract + OCR → parse → store → yield IngestionResult - DocxSource: render → parse → store → update bboxes → yield IngestionResult

DocumentSource ¶

Protocol for document sources that yield progress and results.

Each source encapsulates document-type-specific ingestion logic. The service layer iterates through the source, converting progress to SSE events and handling embeddings after the final result.

MarkdownSource ¶

Source for markdown/text documents.

Flow: parse → store → yield IngestionResult

PdfSource ¶

Source for PDF documents.

Flow: analyze → extract pages + OCR pages → parse → store → yield IngestionResult

Handles per-page OCR detection internally.

DocxSource ¶

Source for DOCX documents.

Flow: render → parse → store → update bboxes → yield IngestionResult

If a DOCX converter is available, renders the document first for high-fidelity display, then parses and stores.

match_segments_to_bboxes ¶

match_segments_to_bboxes(segments: Iterable[tuple[str, str]], *, words: list[ExtractedWord], dpi: int, anchor_words: int = 6) -> dict[str, list[dict[str, object]]]

Best-effort mapping: assign each segment (part_id, content) to PDF bboxes.

This function finds where each document part appears in the rendered PDF by matching word sequences. Used to enable highlighting in the document viewer.

Parameters:	`segments` (`Iterable[tuple[str, str]]`) – Iterable of (part_id, content) tuples. `words` (`list[ExtractedWord]`) – Extracted words from PdfExtractor. `dpi` (`int`) – DPI of the rendered images (for coordinate conversion). `anchor_words` (`int`, default: `6` ) – Number of words to use as anchor for matching.

Returns:	`dict[str, list[dict[str, object]]]` – {part_id: [{"page_number": int, "bbox": {"x0":..,"y0":..,"x1":..,"y1":..}}, ...]}

create_document_source ¶

create_document_source(*, ingester: DocumentIngester, session_id: int, document: DocumentModel, docx_converter: DocxConverterBackend | None = None, db_session: Session | None = None, doc_render_dpi: int = 200) -> DocumentSource

Create the appropriate document source.

Parameters:

ingester (DocumentIngester) –

The document ingester
session_id (int) –

Review session ID
document (DocumentModel) –

The document model
docx_converter (DocxConverterBackend | None, default: None ) –

Optional DOCX converter backend
db_session (Session | None, default: None ) –

Database session (required for DOCX)
doc_render_dpi (int, default: 200 ) –

DPI for DOCX-to-PDF rendering layout extraction

Returns:	`DocumentSource` – Appropriate DocumentSource for the document type