vibe.review.ingestion¶
Document ingestion for VIBE Review.
This module handles: - Parsing documents (PDF, DOCX, Markdown, HTML) via the 4-layer parsing pipeline - Computing embeddings - Storing in database
The actual parsing is delegated to the parsing pipeline in vibe.review.parsing. This module focuses on database storage, embedding computation, and workflow.
IngestionResult ¶
Result of ingesting a document.
EmbeddingProgress ¶
Progress update for embedding computation.
PdfIngestionProgress ¶
Progress update for PDF ingestion.
Inherits from BaseProgress with current and total for page tracking.
Adds parts_found for PDF-specific progress info.
DocumentIngester ¶
Ingest documents into the review system.
Handles the full pipeline: 1. Parse document content (via parsing pipeline) 2. Segment into parts 3. Compute embeddings 4. Store in database
__init__ ¶
__init__(embedding_provider: EmbeddingProvider | None = None, ocr_backend: OcrBackend | None = None, use_yolo_layout: bool = True, **kwargs: object) -> None
Initialize the ingester.
| Parameters: |
|
|---|
ingest_file ¶
ingest_file(file_path: str, session_id: int, language: str | None = None, metadata: dict[str, Any] | None = None, skip_embeddings: bool = False) -> IngestionResult
Ingest a document from file path into a review session.
Supports PDF, DOCX, Markdown, and HTML files.
| Parameters: |
|
|---|
| Returns: |
|
|---|
ingest_markdown ¶
ingest_markdown(content: str, session_id: int, filename: str, language: str | None = None, metadata: dict[str, Any] | None = None, content_type: str = 'text/markdown', skip_embeddings: bool = False) -> IngestionResult
Ingest a Markdown document into a review session.
| Parameters: |
|
|---|
| Returns: |
|
|---|
ingest_docx_file ¶
ingest_docx_file(file_path: str, session_id: int, language: str | None = None, metadata: dict[str, Any] | None = None, filename: str | None = None, skip_embeddings: bool = False) -> IngestionResult
Ingest a DOCX document into a review session.
| Parameters: |
|
|---|
| Returns: |
|
|---|
ingest_pdf_file ¶
ingest_pdf_file(file_path: str, session_id: int, language: str | None = None, metadata: dict[str, Any] | None = None, filename: str | None = None, skip_embeddings: bool = False) -> IngestionResult
Ingest a PDF document into a review session.
| Parameters: |
|
|---|
| Returns: |
|
|---|
iter_ingest_pdf_file ¶
iter_ingest_pdf_file(file_path: str, session_id: int, language: str | None = None, metadata: dict[str, Any] | None = None, filename: str | None = None, skip_embeddings: bool = False) -> Iterator[PdfIngestionProgress | IngestionResult]
Ingest a PDF with progress updates.
Yields progress updates during parsing, then the final result.
| Parameters: |
|
|---|
| Yields: |
|
|---|
iter_compute_part_embeddings ¶
iter_compute_part_embeddings(document_id: int) -> Iterator[EmbeddingProgress | str]
Compute embeddings for document parts, yielding progress after each batch.
| Yields: |
|
|---|
Usage
for progress in ingester.iter_compute_part_embeddings(doc_id): if isinstance(progress, str): # Error occurred handle_error(progress) else: # Progress update update_ui(progress.current, progress.total)
get_document_parts ¶
get_document_parts(document_id: int) -> list[DocumentPartModel]
Retrieve all parts of a document.
| Parameters: |
|
|---|
| Returns: |
|
|---|
is_windows_zone_identifier_sidecar ¶
is_windows_zone_identifier_sidecar(filename: str) -> bool
Return True when a filename is a Windows Zone.Identifier sidecar.
detect_language_from_filename ¶
detect_language_from_filename(filename: str) -> str | None
Detect language from filename by looking for a 2-letter ISO 639-1 code.
in the last segment before the file extension.
Segments are separated by non-alphabetic characters (dots, underscores, hyphens, etc.). For example: - doc.en.txt -> 'en' - contract_sv.pdf -> 'sv' - report-de.docx -> 'de' - katten.doc -> None (no 2-letter segment before extension)
| Parameters: |
|
|---|
| Returns: |
|
|---|