vibe.review.hybrid_search

Unified hybrid search for document and reference parts.

Combines BM25 keyword search with embedding similarity search using Reciprocal Rank Fusion (RRF) for ranking. Supports both: - DocumentPartModel (contract parts being reviewed) - ReferencePartModel (regulatory reference sources)

EmbeddingDimensionMismatchWarning

Warning raised when query and stored embeddings have different dimensions.

EmbeddingDimensionMismatchError

Error raised when embedding dimensions are incompatible.

__init__

__init__(stored_dim: int, provider_dim: int, part_count: int) -> None

Initialize with stored and provider dimensions and count of affected parts.

SearchResult

A single search result with scoring information.

semantic_rank

semantic_rank: int | None

Alias for embedding_rank (semantic search).

SearchResults

Collection of search results with metadata.

semantic_count

semantic_count: int

Alias for embedding_count (semantic search).

total_candidates

total_candidates: int

Total number of candidates from BM25 and embedding search.

PartSearchStrategy

Abstract strategy for searching a specific part model.

Subclasses define model-specific behavior for content access, filtering, and BM25 search approach.

model_class

model_class: type[T]

The SQLAlchemy model class to search.

get_content

get_content(part: T) -> str

Get the text content from a part.

get_content_column

get_content_column() -> object

Get the SQLAlchemy column for content.

get_embedding_column

get_embedding_column() -> Any

Get the SQLAlchemy column for embeddings.

apply_filters

apply_filters(query: Any, **kwargs: object) -> object

Apply model-specific filters to the query.

bm25_search(session: Session, query_text: str, base_query: Any, limit: int, language: str | None) -> list[SearchResult[T]]

Perform BM25-style keyword search.

DocumentPartStrategy

Search strategy for document parts (contracts being reviewed).

model_class

model_class: type[DocumentPartModel]

Return DocumentPartModel as the searchable model class.

get_content

get_content(part: DocumentPartModel) -> str

Extract text content from a document part.

get_content_column

get_content_column() -> object

Return the content column for BM25 text search.

get_embedding_column

get_embedding_column() -> Any

Return the embedding column for vector similarity search.

apply_filters

apply_filters(query: Any, **kwargs: object) -> object

Filter query by document_id if provided.

bm25_search(session: Session, query_text: str, base_query: Any, limit: int, language: str | None) -> list[SearchResult[DocumentPartModel]]

Use ParadeDB BM25 search.

ReferencePartStrategy

Search strategy for reference parts (regulatory sources).

model_class

model_class: type[ReferencePartModel]

Return ReferencePartModel as the searchable model class.

get_content

get_content(part: ReferencePartModel) -> str

Extract text content from a reference part.

get_content_column

get_content_column() -> object

Return the text column for BM25 text search.

get_embedding_column

get_embedding_column() -> Any

Return the embedding column for vector similarity search.

apply_filters

apply_filters(query: Any, **kwargs: object) -> object

Filter query by language and/or source_id, joining with ReferenceSourceModel.

bm25_search(session: Session, query_text: str, base_query: Any, limit: int, language: str | None) -> list[SearchResult[ReferencePartModel]]

Use ParadeDB BM25 search.

HybridSearcher

Hybrid searcher combining BM25 and embedding similarity search.

Uses Reciprocal Rank Fusion (RRF) to combine results from keyword and vector search for improved retrieval quality.

Usage

For document parts

searcher = HybridSearcher(session, DocumentPartStrategy()) results = searcher.search("audit rights", document_id=1)

For reference parts

searcher = HybridSearcher(session, ReferencePartStrategy()) results = searcher.search("ICT risk management", language="en")

__init__

__init__(session: Session, strategy: PartSearchStrategy[T], embedding_provider: EmbeddingProvider | None = None, rrf_k: int = 60) -> None

Initialize the searcher.

Parameters:
  • session (Session) –

    SQLAlchemy session.

  • strategy (PartSearchStrategy[T]) –

    Search strategy for the target model.

  • embedding_provider (EmbeddingProvider | None, default: None ) –

    Provider for query embeddings.

  • rrf_k (int, default: 60 ) –

    RRF constant (default 60).

search

search(query: str, limit: int = 50, bm25_weight: float = 0.5, embedding_weight: float = 0.5, bm25_limit: int = 100, embedding_limit: int = 100, language: str | None = None, **filter_kwargs: object) -> SearchResults[T]

Perform hybrid search.

Parameters:
  • query (str) –

    Search query text.

  • limit (int, default: 50 ) –

    Number of results to return.

  • bm25_weight (float, default: 0.5 ) –

    Weight for BM25 in RRF.

  • embedding_weight (float, default: 0.5 ) –

    Weight for embedding in RRF.

  • bm25_limit (int, default: 100 ) –

    Max BM25 candidates.

  • embedding_limit (int, default: 100 ) –

    Max embedding candidates.

  • language (str | None, default: None ) –

    Language for text search.

  • **filter_kwargs (object, default: {} ) –

    Model-specific filters (document_id, source_id, etc.)

Returns:

sanitize_bm25_query

sanitize_bm25_query(query: str) -> str

Sanitize a query string for ParadeDB BM25 search.

ParadeDB/Tantivy interprets certain characters as query operators. This function escapes all special characters to enable literal text search.

Special characters that need escaping: + - && || ! ( ) { } [ ] ^ " ~ * ? : \ /

get_stored_embedding_dimension

get_stored_embedding_dimension(session: Session, *, document_id: int | None = None, session_id: int | None = None) -> tuple[int | None, int]

Get the dimension of stored embeddings.

Parameters:
  • session (Session) –

    Database session.

  • document_id (int | None, default: None ) –

    Optional document ID to filter by.

  • session_id (int | None, default: None ) –

    Optional session ID to filter by (checks all docs in session).

Returns:
  • tuple[int | None, int]

    Tuple of (dimension, count) where dimension is None if no embeddings exist.

check_embedding_dimension_compatibility

check_embedding_dimension_compatibility(session: Session, provider_dim: int, *, document_id: int | None = None, session_id: int | None = None, raise_on_mismatch: bool = True) -> bool

Check if provider dimension is compatible with stored embeddings.

Parameters:
  • session (Session) –

    Database session.

  • provider_dim (int) –

    Dimension of the embedding provider.

  • document_id (int | None, default: None ) –

    Optional document ID to filter by.

  • session_id (int | None, default: None ) –

    Optional session ID to filter by.

  • raise_on_mismatch (bool, default: True ) –

    If True, raise EmbeddingDimensionMismatchError on mismatch.

Returns:
  • bool

    True if compatible (or no stored embeddings), False if mismatch.

Raises:

clear_embeddings

clear_embeddings(session: Session, *, document_id: int | None = None, session_id: int | None = None) -> int

Clear embeddings for parts.

Parameters:
  • session (Session) –

    Database session.

  • document_id (int | None, default: None ) –

    Optional document ID to filter by.

  • session_id (int | None, default: None ) –

    Optional session ID to filter by.

Returns:
  • int

    Number of parts cleared.

search_document_parts

search_document_parts(session: Session, query: str, document_id: int, limit: int = 50, language: str | None = None, embedding_provider: EmbeddingProvider | None = None) -> SearchResults[DocumentPartModel]

Search document parts for a query.

search_reference_parts

search_reference_parts(session: Session, query: str, limit: int = 50, language: str | None = None, source_id: str | None = None, embedding_provider: EmbeddingProvider | None = None) -> SearchResults[ReferencePartModel]

Search reference parts for a query.