Skip to content

LLM Provider Architecture

Architectural reference for VIBE's LLM provider abstraction layer. Optimized for LLM consumption.

See also:

  • assistant.md - AI-assisted interview system (uses providers, streaming + tools)
  • review.md - Document review (uses providers, structured single-shot)

1. OVERVIEW

The provider system decouples caller logic from specific LLM APIs via a single base class. All providers implement the same LLMProvider interface; callers pick the surface that matches their workload.

Key Insight: The same LLMProvider base class serves two distinct workloads:

  • Assistant: Streaming + tool-calling via stream_generate() (chat with tool calls, dev replay).
  • Review: Structured JSON single-shot via generate_structured() (relevance filtering, compliance aggregation, question-answer suggestion). Wraps the per-attempt _generate_structured_once in a retry loop with truncation-aware max_tokens growth and parse-failure classification.

Earlier revisions had a separate vibe/review/llm.py::BaseLLMClient for the Review side; that module has been removed and Review now goes through LLMProvider. Mock providers for tests live alongside their callers — vibe/providers/llm/mock.py for the assistant streaming path, vibe/review/services/mock_llm_provider.py::MockReviewLLMProvider for the structured-output path.

2. BASE PROVIDER

Location: vibe/providers/llm/base.py

LLMProvider abstract class defines:

  • stream_generate() -- Returns Generator[StreamChunk] (assistant streaming surface)
  • generate_structured() -- Single-shot JSON-schema-constrained call (review surface; see "Structured Output Surface" below)
  • _generate_structured_once() -- Per-attempt method subclasses override; generate_structured() wraps this in retry/truncation logic
  • get_capabilities() -- Returns ProviderCapabilities (frozen dataclass: structured_output, streaming, tools, streaming_tools, chat)
  • Message converter (via MessageConverter subclass)
  • Session recording/playback (see Section 5)

ProviderConfig dataclass:

  • model, temperature, max_tokens, timeout, api_key, base_url
  • tools_config: bool | None -- Override provider's default tool support
  • get_effective_tools_enabled() resolves: explicit config > provider capability default

ProviderWithConfig dataclass (returned by ProviderFactory.create()):

  • provider: LLMProvider, endpoint_config: dict, endpoint_name: str

2.1 Structured Output Surface

Used by VIBE Review for relevance filtering, compliance aggregation, and question-answer suggestion.

StructuredOutput dataclass — result of one generate_structured call:

  • data: Any — Parsed JSON object conforming to the requested schema (or None when the response was truncated/malformed; callers should consult finish_reason and raw_content in that case).
  • model: str, usage: UsageStats — bookkeeping.
  • finish_reason: str | None — provider stop reason ("stop" / "length" / "tool_calls" / etc.). The retry layer uses "length" to detect max_tokens truncation.
  • raw_content: str | None — raw response body, kept for diagnostics.
  • reasoning: str | None — chain-of-thought text emitted by reasoning models (gpt-oss, o-series, Claude extended thinking) when the provider exposes it. Captured but not acted on by the base layer; surface in dev UI.

Exceptions (all inherit StructuredOutputError):

  • StructuredOutputTruncatedError — Raised when finish_reason == "length" or the parser hit unbalanced braces. The retry loop doubles max_tokens (capped at 32 768) and retries.
  • StructuredOutputParseError — Raised when the response was 200 but couldn't be parsed (e.g., a reverse-proxy HTML page leaked through, or the model emitted prose instead of JSON). Retried as-is on the first occurrence; on a free-form attempt, escalates to schema-mode on the next try.

Retry & fallback behaviour of generate_structured:

  • Default 2 retries (3 total attempts). Initial backoff 1 s, capped at 10 s.
  • Free-form first, schema-constrained on fallback. First attempt sends no response_format; vLLM's grammar path takes requests off the speculative-decoding + prefix-cache fast lanes, so unconstrained calls finish substantially faster when the model gets the JSON shape right on its own. On any parse/truncation failure, subsequent attempts re-issue with response_format: json_schema.
  • Truncation-aware max_tokens growth. On StructuredOutputTruncatedError (and after escalating to schema-mode if not already), the loop doubles current_max_tokens up to _STRUCTURED_MAX_TOKENS_HARD_CAP = 32_768.
  • Failure classification. _raise_classified_parse_failure(output) distinguishes truncated (raise StructuredOutputTruncatedError) from parse-failed (raise StructuredOutputParseError) by inspecting finish_reason and the trailing character of raw_content. _is_structured_retryable(exc) classifies httpx/openai exceptions: 429/5xx and connection/timeout errors are retryable; everything else (auth, schema validation, 4xx other than 429) propagates immediately.
  • Reasoning prefix. When a reasoning effort is in effect, a Reasoning: <level>\n\n prefix is prepended to the system prompt — this is the only knob Berget's vllm router honours at full strength on gpt-oss models. Provider-specific extras (OpenAI's top-level reasoning_effort, Berget's extra_body.reasoning) still apply on top.

3. AVAILABLE PROVIDERS

Provider Location Notes
OpenAI vibe/providers/llm/openai.py GPT models
Gemini vibe/providers/llm/gemini.py Thinking mode support
Anthropic vibe/providers/llm/anthropic.py Claude models
Ollama vibe/providers/llm/ollama.py Local models
Mistral vibe/providers/llm/mistral.py Mistral models
Mock vibe/providers/llm/mock.py Testing with configurable responses
SystemProxyProvider vibe/providers/llm/system_proxy_provider.py Wraps any provider, emits system questions as tool calls before delegating

SystemProxyProvider: Composition pattern -- on sequence 1 checks for pending system questions and emits as ask_question tool calls. On sequence 2+ delegates to real provider. Chunks marked proxy_generated=True.

4. MESSAGE CONVERSION

Location: vibe/providers/llm/message_converter.py

Each provider has a different message format. The MessageConverter base class uses @singledispatchmethod for type-based dispatch.

Converters:

  • InternalFormatConverter -- Uses message_to_dict() for internal format
  • IdentityConverter -- Returns Message objects unchanged (MockProvider)
  • Provider-specific: OpenAIChatConverter, AnthropicMessageConverter, etc. (in respective modules)

Principle: Message classes remain pure data containers. Each provider defines its own converter without touching Message classes.

5. CONFIGURATION & REPLAY

Endpoints defined in config.yml:

llm_endpoints:
  gpt4:
    provider: "vibe.providers.llm.openai.OpenAIProvider"
    config:
      model: "gpt-4-turbo"
      api_key: "${OPENAI_API_KEY}"
      tools: true

Dev Mode Features:

  • Endpoint switching via ?endpoint=...
  • Tools toggle via ?tools=0
  • Recording via ?record=name (saves JSONL to data/logs/assistant/)
  • Playback via ?playback=name (no API calls)

Replay System: Records JSONL with request/response entries. Playback config:

config:
  playback_from_file: "data/logs/assistant/llm_20241201.jsonl"
  playback_session_id: "abc123"     # Optional filter
  playback_sequence: 2              # Optional specific turn
  playback_delay_ms: 50             # Simulate streaming delay

Each provider overrides _recorded_payload_to_native() to convert recorded JSON back to SDK-specific objects.

6. FILE LOCATION INDEX

What Where
Base class vibe/providers/llm/base.py::LLMProvider
ProviderConfig vibe/providers/llm/base.py::ProviderConfig
ProviderCapabilities vibe/providers/llm/base.py::ProviderCapabilities
Structured output result vibe/providers/llm/base.py::StructuredOutput
Structured output errors vibe/providers/llm/base.py::StructuredOutputError, StructuredOutputTruncatedError, StructuredOutputParseError
Structured retry helpers vibe/providers/llm/base.py::_raise_classified_parse_failure, _is_structured_retryable
Structured public surface vibe/providers/llm/base.py::LLMProvider.generate_structured, _generate_structured_once, describe_structured_request
Stream chunks vibe/providers/llm/types.py::ChunkType, StreamChunk (re-exported from base.py)
Message converter vibe/providers/llm/message_converter.py::MessageConverter
Implementations vibe/providers/llm/{openai,gemini,anthropic,ollama,mistral,mock}.py
System proxy vibe/providers/llm/system_proxy_provider.py
Tool definitions vibe/providers/llm/tools.py
Provider factory vibe/assistant/services/provider_factory.py::ProviderFactory
Mock provider (review) vibe/review/services/mock_llm_provider.py::MockReviewLLMProvider

Document Version: 1.1 Last Updated: 2026-04-28 Notes: Documented the structured-output surface (generate_structured, _generate_structured_once, StructuredOutput, StructuredOutputError/Truncated/Parse, retry/truncation/free-form-fallback behaviour). Removed stale "Review uses a separate vibe/review/llm.py client" framing — Review now goes through LLMProvider. Added MockReviewLLMProvider row.