vibe.review.document_sources¶
Document sources for unified ingestion streaming.
Each DocumentSource encapsulates all document-type-specific complexity, yielding progress updates and a final IngestionResult. The service layer has one unified streaming method that iterates, converts progress to SSE, and handles embeddings after receiving the result.
Flow for each source: - MarkdownSource: parse → store → yield IngestionResult - PdfSource: analyze → extract + OCR → parse → store → yield IngestionResult - DocxSource: render → parse → store → update bboxes → yield IngestionResult
DocumentSource ¶
Protocol for document sources that yield progress and results.
Each source encapsulates document-type-specific ingestion logic. The service layer iterates through the source, converting progress to SSE events and handling embeddings after the final result.
MarkdownSource ¶
Source for markdown/text documents.
Flow: parse → store → yield IngestionResult
PdfSource ¶
Source for PDF documents.
Flow: analyze → extract pages + OCR pages → parse → store → yield IngestionResult
Handles per-page OCR detection internally.
DocxSource ¶
Source for DOCX documents.
Flow: render → parse → store → update bboxes → yield IngestionResult
If a DOCX converter is available, renders the document first for high-fidelity display, then parses and stores.
match_segments_to_bboxes ¶
match_segments_to_bboxes(segments: Iterable[tuple[str, str]], *, words: list[ExtractedWord], dpi: int, anchor_words: int = 6) -> dict[str, list[dict[str, object]]]
Best-effort mapping: assign each segment (part_id, content) to PDF bboxes.
This function finds where each document part appears in the rendered PDF by matching word sequences. Used to enable highlighting in the document viewer.
| Parameters: |
|
|---|
| Returns: |
|
|---|
create_document_source ¶
create_document_source(*, ingester: DocumentIngester, session_id: int, document: DocumentModel, docx_converter: DocxConverterBackend | None = None, db_session: Session | None = None, doc_render_dpi: int = 200) -> DocumentSource
Create the appropriate document source.
| Parameters: |
|
|---|
| Returns: |
|
|---|