LLM Provider Architecture¶
Architectural reference for VIBE's LLM provider abstraction layer. Optimized for LLM consumption.
See also:
assistant.md- AI-assisted interview system (uses providers, streaming + tools)review.md- Document review (uses providers, structured single-shot)
1. OVERVIEW¶
The provider system decouples caller logic from specific LLM APIs via a single base class. All providers implement the same LLMProvider interface; callers pick the surface that matches their workload.
Key Insight: The same LLMProvider base class serves two distinct workloads:
- Assistant: Streaming + tool-calling via
stream_generate()(chat with tool calls, dev replay). - Review: Structured JSON single-shot via
generate_structured()(relevance filtering, compliance aggregation, question-answer suggestion). Wraps the per-attempt_generate_structured_oncein a retry loop with truncation-awaremax_tokensgrowth and parse-failure classification.
Earlier revisions had a separate vibe/review/llm.py::BaseLLMClient for the Review side; that module has been removed and Review now goes through LLMProvider. Mock providers for tests live alongside their callers — vibe/providers/llm/mock.py for the assistant streaming path, vibe/review/services/mock_llm_provider.py::MockReviewLLMProvider for the structured-output path.
2. BASE PROVIDER¶
Location: vibe/providers/llm/base.py
LLMProvider abstract class defines:
stream_generate()-- ReturnsGenerator[StreamChunk](assistant streaming surface)generate_structured()-- Single-shot JSON-schema-constrained call (review surface; see "Structured Output Surface" below)_generate_structured_once()-- Per-attempt method subclasses override;generate_structured()wraps this in retry/truncation logicget_capabilities()-- ReturnsProviderCapabilities(frozen dataclass:structured_output,streaming,tools,streaming_tools,chat)- Message converter (via
MessageConvertersubclass) - Session recording/playback (see Section 5)
ProviderConfig dataclass:
model,temperature,max_tokens,timeout,api_key,base_urltools_config: bool | None-- Override provider's default tool supportget_effective_tools_enabled()resolves: explicit config > provider capability default
ProviderWithConfig dataclass (returned by ProviderFactory.create()):
provider: LLMProvider,endpoint_config: dict,endpoint_name: str
2.1 Structured Output Surface¶
Used by VIBE Review for relevance filtering, compliance aggregation, and question-answer suggestion.
StructuredOutput dataclass — result of one generate_structured call:
data: Any— Parsed JSON object conforming to the requested schema (orNonewhen the response was truncated/malformed; callers should consultfinish_reasonandraw_contentin that case).model: str,usage: UsageStats— bookkeeping.finish_reason: str | None— provider stop reason ("stop"/"length"/"tool_calls"/ etc.). The retry layer uses"length"to detectmax_tokenstruncation.raw_content: str | None— raw response body, kept for diagnostics.reasoning: str | None— chain-of-thought text emitted by reasoning models (gpt-oss, o-series, Claude extended thinking) when the provider exposes it. Captured but not acted on by the base layer; surface in dev UI.
Exceptions (all inherit StructuredOutputError):
StructuredOutputTruncatedError— Raised whenfinish_reason == "length"or the parser hit unbalanced braces. The retry loop doublesmax_tokens(capped at 32 768) and retries.StructuredOutputParseError— Raised when the response was 200 but couldn't be parsed (e.g., a reverse-proxy HTML page leaked through, or the model emitted prose instead of JSON). Retried as-is on the first occurrence; on a free-form attempt, escalates to schema-mode on the next try.
Retry & fallback behaviour of generate_structured:
- Default 2 retries (3 total attempts). Initial backoff 1 s, capped at 10 s.
- Free-form first, schema-constrained on fallback. First attempt sends no
response_format; vLLM's grammar path takes requests off the speculative-decoding + prefix-cache fast lanes, so unconstrained calls finish substantially faster when the model gets the JSON shape right on its own. On any parse/truncation failure, subsequent attempts re-issue withresponse_format: json_schema. - Truncation-aware
max_tokensgrowth. OnStructuredOutputTruncatedError(and after escalating to schema-mode if not already), the loop doublescurrent_max_tokensup to_STRUCTURED_MAX_TOKENS_HARD_CAP = 32_768. - Failure classification.
_raise_classified_parse_failure(output)distinguishes truncated (raiseStructuredOutputTruncatedError) from parse-failed (raiseStructuredOutputParseError) by inspectingfinish_reasonand the trailing character ofraw_content._is_structured_retryable(exc)classifies httpx/openai exceptions: 429/5xx and connection/timeout errors are retryable; everything else (auth, schema validation, 4xx other than 429) propagates immediately. - Reasoning prefix. When a reasoning effort is in effect, a
Reasoning: <level>\n\nprefix is prepended to the system prompt — this is the only knob Berget's vllm router honours at full strength on gpt-oss models. Provider-specific extras (OpenAI's top-levelreasoning_effort, Berget'sextra_body.reasoning) still apply on top.
3. AVAILABLE PROVIDERS¶
| Provider | Location | Notes |
|---|---|---|
| OpenAI | vibe/providers/llm/openai.py |
GPT models |
| Gemini | vibe/providers/llm/gemini.py |
Thinking mode support |
| Anthropic | vibe/providers/llm/anthropic.py |
Claude models |
| Ollama | vibe/providers/llm/ollama.py |
Local models |
| Mistral | vibe/providers/llm/mistral.py |
Mistral models |
| Mock | vibe/providers/llm/mock.py |
Testing with configurable responses |
| SystemProxyProvider | vibe/providers/llm/system_proxy_provider.py |
Wraps any provider, emits system questions as tool calls before delegating |
SystemProxyProvider: Composition pattern -- on sequence 1 checks for pending system questions and emits as ask_question tool calls. On sequence 2+ delegates to real provider. Chunks marked proxy_generated=True.
4. MESSAGE CONVERSION¶
Location: vibe/providers/llm/message_converter.py
Each provider has a different message format. The MessageConverter base class uses @singledispatchmethod for type-based dispatch.
Converters:
InternalFormatConverter-- Usesmessage_to_dict()for internal formatIdentityConverter-- Returns Message objects unchanged (MockProvider)- Provider-specific:
OpenAIChatConverter,AnthropicMessageConverter, etc. (in respective modules)
Principle: Message classes remain pure data containers. Each provider defines its own converter without touching Message classes.
5. CONFIGURATION & REPLAY¶
Endpoints defined in config.yml:
llm_endpoints:
gpt4:
provider: "vibe.providers.llm.openai.OpenAIProvider"
config:
model: "gpt-4-turbo"
api_key: "${OPENAI_API_KEY}"
tools: true
Dev Mode Features:
- Endpoint switching via
?endpoint=... - Tools toggle via
?tools=0 - Recording via
?record=name(saves JSONL todata/logs/assistant/) - Playback via
?playback=name(no API calls)
Replay System: Records JSONL with request/response entries. Playback config:
config:
playback_from_file: "data/logs/assistant/llm_20241201.jsonl"
playback_session_id: "abc123" # Optional filter
playback_sequence: 2 # Optional specific turn
playback_delay_ms: 50 # Simulate streaming delay
Each provider overrides _recorded_payload_to_native() to convert recorded JSON back to SDK-specific objects.
6. FILE LOCATION INDEX¶
| What | Where |
|---|---|
| Base class | vibe/providers/llm/base.py::LLMProvider |
| ProviderConfig | vibe/providers/llm/base.py::ProviderConfig |
| ProviderCapabilities | vibe/providers/llm/base.py::ProviderCapabilities |
| Structured output result | vibe/providers/llm/base.py::StructuredOutput |
| Structured output errors | vibe/providers/llm/base.py::StructuredOutputError, StructuredOutputTruncatedError, StructuredOutputParseError |
| Structured retry helpers | vibe/providers/llm/base.py::_raise_classified_parse_failure, _is_structured_retryable |
| Structured public surface | vibe/providers/llm/base.py::LLMProvider.generate_structured, _generate_structured_once, describe_structured_request |
| Stream chunks | vibe/providers/llm/types.py::ChunkType, StreamChunk (re-exported from base.py) |
| Message converter | vibe/providers/llm/message_converter.py::MessageConverter |
| Implementations | vibe/providers/llm/{openai,gemini,anthropic,ollama,mistral,mock}.py |
| System proxy | vibe/providers/llm/system_proxy_provider.py |
| Tool definitions | vibe/providers/llm/tools.py |
| Provider factory | vibe/assistant/services/provider_factory.py::ProviderFactory |
| Mock provider (review) | vibe/review/services/mock_llm_provider.py::MockReviewLLMProvider |
Document Version: 1.1
Last Updated: 2026-04-28
Notes: Documented the structured-output surface (generate_structured, _generate_structured_once, StructuredOutput, StructuredOutputError/Truncated/Parse, retry/truncation/free-form-fallback behaviour). Removed stale "Review uses a separate vibe/review/llm.py client" framing — Review now goes through LLMProvider. Added MockReviewLLMProvider row.