vibe.review.ingestion¶

Document ingestion for VIBE Review.

This module handles: - Parsing documents (PDF, DOCX, Markdown, HTML) via the 4-layer parsing pipeline - Computing embeddings - Storing in database

The actual parsing is delegated to the parsing pipeline in vibe.review.parsing. This module focuses on database storage, embedding computation, and workflow.

IngestionResult ¶

Result of ingesting a document.

EmbeddingProgress ¶

Progress update for embedding computation.

to_dict ¶

to_dict() -> dict[str, Any]

Serialize progress to dictionary for JSON response.

PdfIngestionProgress ¶

Progress update for PDF ingestion.

Inherits from BaseProgress with current and total for page tracking. Adds parts_found for PDF-specific progress info.

to_dict ¶

to_dict() -> dict[str, Any]

Serialize PDF progress including parts_found count.

DocumentIngester ¶

Ingest documents into the review system.

Handles the full pipeline: 1. Parse document content (via parsing pipeline) 2. Segment into parts 3. Compute embeddings 4. Store in database

init ¶

__init__(embedding_provider: EmbeddingProvider | None = None, ocr_backend: OcrBackend | None = None, use_yolo_layout: bool = True, **kwargs: object) -> None

Initialize the ingester.

Parameters:

embedding_provider (EmbeddingProvider | None, default: None ) –

Optional provider for computing embeddings
ocr_backend (OcrBackend | None, default: None ) –

Optional OCR backend for scanned pages
use_yolo_layout (bool, default: True ) –

Whether to use YOLO model for layout detection. Set to False for faster processing (uses rule-based detection).
**kwargs (object, default: {} ) –

Additional arguments (ignored for backwards compatibility)

ingest_file ¶

ingest_file(file_path: str, session_id: int, language: str | None = None, metadata: dict[str, Any] | None = None, skip_embeddings: bool = False) -> IngestionResult

Ingest a document from file path into a review session.

Supports PDF, DOCX, Markdown, and HTML files.

Parameters:	`file_path` (`str`) – Path to the document `session_id` (`int`) – ID of the review session this document belongs to `language` (`str \| None`, default: `None` ) – Optional language code `metadata` (`dict[str, Any] \| None`, default: `None` ) – Optional additional metadata `skip_embeddings` (`bool`, default: `False` ) – Skip embedding computation

Returns:	`IngestionResult` – IngestionResult

ingest_markdown ¶

ingest_markdown(content: str, session_id: int, filename: str, language: str | None = None, metadata: dict[str, Any] | None = None, content_type: str = 'text/markdown', skip_embeddings: bool = False) -> IngestionResult

Ingest a Markdown document into a review session.

Parameters:

content (str) –

Markdown content
session_id (int) –

ID of the review session this document belongs to
filename (str) –

Original filename
language (str | None, default: None ) –

ISO 639-1 language code (e.g., "en", "sv")
metadata (dict[str, Any] | None, default: None ) –

Additional metadata
content_type (str, default: 'text/markdown' ) –

MIME type
skip_embeddings (bool, default: False ) –

Skip embedding computation

Returns:	`IngestionResult` – IngestionResult with statistics

ingest_docx_file ¶

ingest_docx_file(file_path: str, session_id: int, language: str | None = None, metadata: dict[str, Any] | None = None, filename: str | None = None, skip_embeddings: bool = False) -> IngestionResult

Ingest a DOCX document into a review session.

Parameters:

file_path (str) –

Path to DOCX file
session_id (int) –

Review session ID
language (str | None, default: None ) –

Optional language code
metadata (dict[str, Any] | None, default: None ) –

Optional metadata
filename (str | None, default: None ) –

Override filename
skip_embeddings (bool, default: False ) –

Skip embedding computation

Returns:	`IngestionResult` – IngestionResult

ingest_pdf_file ¶

ingest_pdf_file(file_path: str, session_id: int, language: str | None = None, metadata: dict[str, Any] | None = None, filename: str | None = None, skip_embeddings: bool = False) -> IngestionResult

Ingest a PDF document into a review session.

Parameters:

file_path (str) –

Path to PDF file
session_id (int) –

Review session ID
language (str | None, default: None ) –

Optional language code
metadata (dict[str, Any] | None, default: None ) –

Optional metadata
filename (str | None, default: None ) –

Override filename
skip_embeddings (bool, default: False ) –

Skip embedding computation

Returns:	`IngestionResult` – IngestionResult

iter_ingest_pdf_file ¶

iter_ingest_pdf_file(file_path: str, session_id: int, language: str | None = None, metadata: dict[str, Any] | None = None, filename: str | None = None, skip_embeddings: bool = False) -> Iterator[PdfIngestionProgress | IngestionResult]

Ingest a PDF with progress updates.

Yields progress updates during parsing, then the final result.

Parameters:

file_path (str) –

Path to PDF file
session_id (int) –

Review session ID
language (str | None, default: None ) –

Optional language code
metadata (dict[str, Any] | None, default: None ) –

Optional metadata
filename (str | None, default: None ) –

Override filename
skip_embeddings (bool, default: False ) –

Skip embedding computation

Yields:	`PdfIngestionProgress \| IngestionResult` – PdfIngestionProgress during processing `PdfIngestionProgress \| IngestionResult` – IngestionResult when complete

iter_compute_part_embeddings ¶

iter_compute_part_embeddings(document_id: int) -> Iterator[EmbeddingProgress | str]

Compute embeddings for document parts, yielding progress after each batch.

Yields:	`EmbeddingProgress \| str` – EmbeddingProgress for each batch processed `EmbeddingProgress \| str` – str error message if embedding fails (final yield)

Usage

for progress in ingester.iter_compute_part_embeddings(doc_id): if isinstance(progress, str): # Error occurred handle_error(progress) else: # Progress update update_ui(progress.current, progress.total)

get_document_parts ¶

get_document_parts(document_id: int) -> list[DocumentPartModel]

Retrieve all parts of a document.

Parameters:	`document_id` (`int`) – Document ID

Returns:	`list[DocumentPartModel]` – List of DocumentPartModel instances (detached from session)

is_windows_zone_identifier_sidecar ¶

is_windows_zone_identifier_sidecar(filename: str) -> bool

Return True when a filename is a Windows Zone.Identifier sidecar.

detect_language_from_filename ¶

detect_language_from_filename(filename: str) -> str | None

Detect language from filename by looking for a 2-letter ISO 639-1 code.

in the last segment before the file extension.

Segments are separated by non-alphabetic characters (dots, underscores, hyphens, etc.). For example: - doc.en.txt -> 'en' - contract_sv.pdf -> 'sv' - report-de.docx -> 'de' - katten.doc -> None (no 2-letter segment before extension)

Parameters:	`filename` (`str`) – The filename to analyze

Returns:	`str \| None` – ISO 639-1 language code or None if not detected

vibe.review.ingestion¶

IngestionResult ¶

EmbeddingProgress ¶

to_dict ¶

PdfIngestionProgress ¶

to_dict ¶

DocumentIngester ¶

__init__ ¶

ingest_file ¶

ingest_markdown ¶

ingest_docx_file ¶

ingest_pdf_file ¶

iter_ingest_pdf_file ¶

iter_compute_part_embeddings ¶

get_document_parts ¶

is_windows_zone_identifier_sidecar ¶

detect_language_from_filename ¶

init ¶