vibe.review.document_sources

Document sources for unified ingestion streaming.

Each DocumentSource encapsulates all document-type-specific complexity, yielding progress updates and a final IngestionResult. The service layer has one unified streaming method that iterates, converts progress to SSE, and handles embeddings after receiving the result.

Flow for each source: - MarkdownSource: parse → store → yield IngestionResult - PdfSource: analyze → extract + OCR → parse → store → yield IngestionResult - DocxSource: render → parse → store → update bboxes → yield IngestionResult

DocumentSource

Protocol for document sources that yield progress and results.

Each source encapsulates document-type-specific ingestion logic. The service layer iterates through the source, converting progress to SSE events and handling embeddings after the final result.

MarkdownSource

Source for markdown/text documents.

Flow: parse → store → yield IngestionResult

PdfSource

Source for PDF documents.

Flow: analyze → extract pages + OCR pages → parse → store → yield IngestionResult

Handles per-page OCR detection internally.

DocxSource

Source for DOCX documents.

Flow: render → parse → store → update bboxes → yield IngestionResult

If a DOCX converter is available, renders the document first for high-fidelity display, then parses and stores.

match_segments_to_bboxes

match_segments_to_bboxes(segments: Iterable[tuple[str, str]], *, words: list[ExtractedWord], dpi: int, anchor_words: int = 6) -> dict[str, list[dict[str, object]]]

Best-effort mapping: assign each segment (part_id, content) to PDF bboxes.

This function finds where each document part appears in the rendered PDF by matching word sequences. Used to enable highlighting in the document viewer.

Parameters:
  • segments (Iterable[tuple[str, str]]) –

    Iterable of (part_id, content) tuples.

  • words (list[ExtractedWord]) –

    Extracted words from PdfExtractor.

  • dpi (int) –

    DPI of the rendered images (for coordinate conversion).

  • anchor_words (int, default: 6 ) –

    Number of words to use as anchor for matching.

Returns:
  • dict[str, list[dict[str, object]]]

    {part_id: [{"page_number": int, "bbox": {"x0":..,"y0":..,"x1":..,"y1":..}}, ...]}

create_document_source

create_document_source(*, ingester: DocumentIngester, session_id: int, document: DocumentModel, docx_converter: DocxConverterBackend | None = None, db_session: Session | None = None, doc_render_dpi: int = 200) -> DocumentSource

Create the appropriate document source.

Parameters:
  • ingester (DocumentIngester) –

    The document ingester

  • session_id (int) –

    Review session ID

  • document (DocumentModel) –

    The document model

  • docx_converter (DocxConverterBackend | None, default: None ) –

    Optional DOCX converter backend

  • db_session (Session | None, default: None ) –

    Database session (required for DOCX)

  • doc_render_dpi (int, default: 200 ) –

    DPI for DOCX-to-PDF rendering layout extraction

Returns: