LLM Provider Architecture¶

Architectural reference for VIBE's LLM provider abstraction layer. Optimized for LLM consumption.

See also:

assistant.md - AI-assisted interview system (uses providers, streaming + tools)
review.md - Document review (uses providers, structured single-shot)

1. OVERVIEW¶

The provider system decouples caller logic from specific LLM APIs via a single base class. All providers implement the same LLMProvider interface; callers pick the surface that matches their workload.

Key Insight: The same LLMProvider base class serves two distinct workloads:

Assistant: Streaming + tool-calling via stream_generate() (chat with tool calls, dev replay).
Review: Structured JSON single-shot via generate_structured() (relevance filtering, compliance aggregation, question-answer suggestion). Wraps the per-attempt _generate_structured_once in a retry loop with truncation-aware max_tokens growth and parse-failure classification.

Earlier revisions had a separate vibe/review/llm.py::BaseLLMClient for the Review side; that module has been removed and Review now goes through LLMProvider. Mock providers for tests live alongside their callers — vibe/providers/llm/mock.py for the assistant streaming path, vibe/review/services/mock_llm_provider.py::MockReviewLLMProvider for the structured-output path.

2. BASE PROVIDER¶

Location: vibe/providers/llm/base.py

LLMProvider abstract class defines:

stream_generate() -- Returns Generator[StreamChunk] (assistant streaming surface)
generate_structured() -- Single-shot JSON-schema-constrained call (review surface; see "Structured Output Surface" below)
_generate_structured_once() -- Per-attempt method subclasses override; generate_structured() wraps this in retry/truncation logic
get_capabilities() -- Returns ProviderCapabilities (frozen dataclass: structured_output, streaming, tools, streaming_tools, chat)
Message converter (via MessageConverter subclass)
Session recording/playback (see Section 5)

ProviderConfig dataclass:

model, temperature, max_tokens, timeout, api_key, base_url
tools_config: bool | None -- Override provider's default tool support
get_effective_tools_enabled() resolves: explicit config > provider capability default

ProviderWithConfig dataclass (returned by ProviderFactory.create()):

provider: LLMProvider, endpoint_config: dict, endpoint_name: str

2.1 Structured Output Surface¶

Used by VIBE Review for relevance filtering, compliance aggregation, and question-answer suggestion.

StructuredOutput dataclass — result of one generate_structured call:

data: Any — Parsed JSON object conforming to the requested schema (or None when the response was truncated/malformed; callers should consult finish_reason and raw_content in that case).
model: str, usage: UsageStats — bookkeeping.
finish_reason: str | None — provider stop reason ("stop" / "length" / "tool_calls" / etc.). The retry layer uses "length" to detect max_tokens truncation.
raw_content: str | None — raw response body, kept for diagnostics.
reasoning: str | None — chain-of-thought text emitted by reasoning models (gpt-oss, o-series, Claude extended thinking) when the provider exposes it. Captured but not acted on by the base layer; surface in dev UI.

Exceptions (all inherit StructuredOutputError):

StructuredOutputTruncatedError — Raised when finish_reason == "length" or the parser hit unbalanced braces. The retry loop doubles max_tokens (capped at 32 768) and retries.
StructuredOutputParseError — Raised when the response was 200 but couldn't be parsed (e.g., a reverse-proxy HTML page leaked through, or the model emitted prose instead of JSON). Retried as-is on the first occurrence; on a free-form attempt, escalates to schema-mode on the next try.

Retry & fallback behaviour of generate_structured:

Default 2 retries (3 total attempts). Initial backoff 1 s, capped at 10 s.
Free-form first, schema-constrained on fallback. First attempt sends no response_format; vLLM's grammar path takes requests off the speculative-decoding + prefix-cache fast lanes, so unconstrained calls finish substantially faster when the model gets the JSON shape right on its own. On any parse/truncation failure, subsequent attempts re-issue with response_format: json_schema.
Truncation-aware max_tokens growth. On StructuredOutputTruncatedError (and after escalating to schema-mode if not already), the loop doubles current_max_tokens up to _STRUCTURED_MAX_TOKENS_HARD_CAP = 32_768.
Failure classification. _raise_classified_parse_failure(output) distinguishes truncated (raise StructuredOutputTruncatedError) from parse-failed (raise StructuredOutputParseError) by inspecting finish_reason and the trailing character of raw_content. _is_structured_retryable(exc) classifies httpx/openai exceptions: 429/5xx and connection/timeout errors are retryable; everything else (auth, schema validation, 4xx other than 429) propagates immediately.
Reasoning prefix. When a reasoning effort is in effect, a Reasoning: <level>\n\n prefix is prepended to the system prompt — this is the only knob Berget's vllm router honours at full strength on gpt-oss models. Provider-specific extras (OpenAI's top-level reasoning_effort, Berget's extra_body.reasoning) still apply on top.

3. AVAILABLE PROVIDERS¶

Provider	Location	Notes
OpenAI	`vibe/providers/llm/openai.py`	GPT models
Gemini	`vibe/providers/llm/gemini.py`	Thinking mode support
Anthropic	`vibe/providers/llm/anthropic.py`	Claude models
Ollama	`vibe/providers/llm/ollama.py`	Local models
Mistral	`vibe/providers/llm/mistral.py`	Mistral models
Mock	`vibe/providers/llm/mock.py`	Testing with configurable responses
SystemProxyProvider	`vibe/providers/llm/system_proxy_provider.py`	Wraps any provider, emits system questions as tool calls before delegating

SystemProxyProvider: Composition pattern -- on sequence 1 checks for pending system questions and emits as ask_question tool calls. On sequence 2+ delegates to real provider. Chunks marked proxy_generated=True.

4. MESSAGE CONVERSION¶

Location: vibe/providers/llm/message_converter.py

Each provider has a different message format. The MessageConverter base class uses @singledispatchmethod for type-based dispatch.

Converters:

InternalFormatConverter -- Uses message_to_dict() for internal format
IdentityConverter -- Returns Message objects unchanged (MockProvider)
Provider-specific: OpenAIChatConverter, AnthropicMessageConverter, etc. (in respective modules)

Principle: Message classes remain pure data containers. Each provider defines its own converter without touching Message classes.

5. CONFIGURATION & REPLAY¶

Endpoints defined in config.yml:

llm_endpoints:
  gpt4:
    provider: "vibe.providers.llm.openai.OpenAIProvider"
    config:
      model: "gpt-4-turbo"
      api_key: "${OPENAI_API_KEY}"
      tools: true

Dev Mode Features:

Endpoint switching via ?endpoint=...
Tools toggle via ?tools=0
Recording via ?record=name (saves JSONL to data/logs/assistant/)
Playback via ?playback=name (no API calls)

Replay System: Records JSONL with request/response entries. Playback config:

config:
  playback_from_file: "data/logs/assistant/llm_20241201.jsonl"
  playback_session_id: "abc123"     # Optional filter
  playback_sequence: 2              # Optional specific turn
  playback_delay_ms: 50             # Simulate streaming delay

Each provider overrides _recorded_payload_to_native() to convert recorded JSON back to SDK-specific objects.

6. FILE LOCATION INDEX¶

What	Where
Base class	`vibe/providers/llm/base.py::LLMProvider`
ProviderConfig	`vibe/providers/llm/base.py::ProviderConfig`
ProviderCapabilities	`vibe/providers/llm/base.py::ProviderCapabilities`
Structured output result	`vibe/providers/llm/base.py::StructuredOutput`
Structured output errors	`vibe/providers/llm/base.py::StructuredOutputError, StructuredOutputTruncatedError, StructuredOutputParseError`
Structured retry helpers	`vibe/providers/llm/base.py::_raise_classified_parse_failure, _is_structured_retryable`
Structured public surface	`vibe/providers/llm/base.py::LLMProvider.generate_structured, _generate_structured_once, describe_structured_request`
Stream chunks	`vibe/providers/llm/types.py::ChunkType, StreamChunk` (re-exported from `base.py`)
Message converter	`vibe/providers/llm/message_converter.py::MessageConverter`
Implementations	`vibe/providers/llm/{openai,gemini,anthropic,ollama,mistral,mock}.py`
System proxy	`vibe/providers/llm/system_proxy_provider.py`
Tool definitions	`vibe/providers/llm/tools.py`
Provider factory	`vibe/assistant/services/provider_factory.py::ProviderFactory`
Mock provider (review)	`vibe/review/services/mock_llm_provider.py::MockReviewLLMProvider`

Document Version: 1.1 Last Updated: 2026-04-28 Notes: Documented the structured-output surface (generate_structured, _generate_structured_once, StructuredOutput, StructuredOutputError/Truncated/Parse, retry/truncation/free-form-fallback behaviour). Removed stale "Review uses a separate vibe/review/llm.py client" framing — Review now goes through LLMProvider. Added MockReviewLLMProvider row.