vibe.review.ingestion

Document ingestion for VIBE Review.

This module handles: - Parsing documents (PDF, DOCX, Markdown, HTML) via the 4-layer parsing pipeline - Computing embeddings - Storing in database

The actual parsing is delegated to the parsing pipeline in vibe.review.parsing. This module focuses on database storage, embedding computation, and workflow.

IngestionResult

Result of ingesting a document.

EmbeddingProgress

Progress update for embedding computation.

to_dict

to_dict() -> dict[str, Any]

Serialize progress to dictionary for JSON response.

PdfIngestionProgress

Progress update for PDF ingestion.

Inherits from BaseProgress with current and total for page tracking. Adds parts_found for PDF-specific progress info.

to_dict

to_dict() -> dict[str, Any]

Serialize PDF progress including parts_found count.

DocumentIngester

Ingest documents into the review system.

Handles the full pipeline: 1. Parse document content (via parsing pipeline) 2. Segment into parts 3. Compute embeddings 4. Store in database

__init__

__init__(embedding_provider: EmbeddingProvider | None = None, ocr_backend: OcrBackend | None = None, use_yolo_layout: bool = True, **kwargs: object) -> None

Initialize the ingester.

Parameters:
  • embedding_provider (EmbeddingProvider | None, default: None ) –

    Optional provider for computing embeddings

  • ocr_backend (OcrBackend | None, default: None ) –

    Optional OCR backend for scanned pages

  • use_yolo_layout (bool, default: True ) –

    Whether to use YOLO model for layout detection. Set to False for faster processing (uses rule-based detection).

  • **kwargs (object, default: {} ) –

    Additional arguments (ignored for backwards compatibility)

ingest_file

ingest_file(file_path: str, session_id: int, language: str | None = None, metadata: dict[str, Any] | None = None, skip_embeddings: bool = False) -> IngestionResult

Ingest a document from file path into a review session.

Supports PDF, DOCX, Markdown, and HTML files.

Parameters:
  • file_path (str) –

    Path to the document

  • session_id (int) –

    ID of the review session this document belongs to

  • language (str | None, default: None ) –

    Optional language code

  • metadata (dict[str, Any] | None, default: None ) –

    Optional additional metadata

  • skip_embeddings (bool, default: False ) –

    Skip embedding computation

Returns:

ingest_markdown

ingest_markdown(content: str, session_id: int, filename: str, language: str | None = None, metadata: dict[str, Any] | None = None, content_type: str = 'text/markdown', skip_embeddings: bool = False) -> IngestionResult

Ingest a Markdown document into a review session.

Parameters:
  • content (str) –

    Markdown content

  • session_id (int) –

    ID of the review session this document belongs to

  • filename (str) –

    Original filename

  • language (str | None, default: None ) –

    ISO 639-1 language code (e.g., "en", "sv")

  • metadata (dict[str, Any] | None, default: None ) –

    Additional metadata

  • content_type (str, default: 'text/markdown' ) –

    MIME type

  • skip_embeddings (bool, default: False ) –

    Skip embedding computation

Returns:

ingest_docx_file

ingest_docx_file(file_path: str, session_id: int, language: str | None = None, metadata: dict[str, Any] | None = None, filename: str | None = None, skip_embeddings: bool = False) -> IngestionResult

Ingest a DOCX document into a review session.

Parameters:
  • file_path (str) –

    Path to DOCX file

  • session_id (int) –

    Review session ID

  • language (str | None, default: None ) –

    Optional language code

  • metadata (dict[str, Any] | None, default: None ) –

    Optional metadata

  • filename (str | None, default: None ) –

    Override filename

  • skip_embeddings (bool, default: False ) –

    Skip embedding computation

Returns:

ingest_pdf_file

ingest_pdf_file(file_path: str, session_id: int, language: str | None = None, metadata: dict[str, Any] | None = None, filename: str | None = None, skip_embeddings: bool = False) -> IngestionResult

Ingest a PDF document into a review session.

Parameters:
  • file_path (str) –

    Path to PDF file

  • session_id (int) –

    Review session ID

  • language (str | None, default: None ) –

    Optional language code

  • metadata (dict[str, Any] | None, default: None ) –

    Optional metadata

  • filename (str | None, default: None ) –

    Override filename

  • skip_embeddings (bool, default: False ) –

    Skip embedding computation

Returns:

iter_ingest_pdf_file

iter_ingest_pdf_file(file_path: str, session_id: int, language: str | None = None, metadata: dict[str, Any] | None = None, filename: str | None = None, skip_embeddings: bool = False) -> Iterator[PdfIngestionProgress | IngestionResult]

Ingest a PDF with progress updates.

Yields progress updates during parsing, then the final result.

Parameters:
  • file_path (str) –

    Path to PDF file

  • session_id (int) –

    Review session ID

  • language (str | None, default: None ) –

    Optional language code

  • metadata (dict[str, Any] | None, default: None ) –

    Optional metadata

  • filename (str | None, default: None ) –

    Override filename

  • skip_embeddings (bool, default: False ) –

    Skip embedding computation

Yields:

iter_compute_part_embeddings

iter_compute_part_embeddings(document_id: int) -> Iterator[EmbeddingProgress | str]

Compute embeddings for document parts, yielding progress after each batch.

Yields:
Usage

for progress in ingester.iter_compute_part_embeddings(doc_id): if isinstance(progress, str): # Error occurred handle_error(progress) else: # Progress update update_ui(progress.current, progress.total)

get_document_parts

get_document_parts(document_id: int) -> list[DocumentPartModel]

Retrieve all parts of a document.

Parameters:
  • document_id (int) –

    Document ID

Returns:

is_windows_zone_identifier_sidecar

is_windows_zone_identifier_sidecar(filename: str) -> bool

Return True when a filename is a Windows Zone.Identifier sidecar.

detect_language_from_filename

detect_language_from_filename(filename: str) -> str | None

Detect language from filename by looking for a 2-letter ISO 639-1 code.

in the last segment before the file extension.

Segments are separated by non-alphabetic characters (dots, underscores, hyphens, etc.). For example: - doc.en.txt -> 'en' - contract_sv.pdf -> 'sv' - report-de.docx -> 'de' - katten.doc -> None (no 2-letter segment before extension)

Parameters:
  • filename (str) –

    The filename to analyze

Returns:
  • str | None

    ISO 639-1 language code or None if not detected