LLM Provider Configuration¶
Membrain acts as a unified gateway that routes requests to multiple LLM backends. This guide covers how the provider abstraction works, how to configure each supported provider, and how intelligent routing selects the best model for each request.
Overview¶
Every LLM backend in Membrain implements the BaseProvider abstract class
defined in src/membrain/providers/base.py. The interface is intentionally
small:
class BaseProvider(ABC):
@abstractmethod
async def chat(self, request: ChatCompletionRequest) -> ChatCompletionResponse:
...
async def stream_chat(self, model, messages, temperature=None, max_tokens=None, **kwargs):
# Default: wraps chat() as a single chunk. Override for true SSE streaming.
...
async def close(self):
# Release HTTP clients, connections, etc.
...
Key design decisions:
- All providers accept and return the OpenAI-compatible format defined in
src/membrain/schemas.py(ChatCompletionRequest/ChatCompletionResponse). Providers that talk to non-OpenAI APIs (e.g., Anthropic) translate internally. stream_chat()has a default implementation that wraps the non-streamingchat()response as a single SSE chunk. Providers with native streaming (OpenAI) override this for true token-by-token delivery.close()is called during shutdown to release resources (httpx clients, etc.).
Clients send standard OpenAI-format requests to Membrain's
/v1/chat/completions endpoint. The gateway resolves the model field to a
provider, forwards the request, and returns the normalized response.
Provider Setup¶
Anthropic¶
Source: src/membrain/providers/anthropic.py
The Anthropic provider calls the Anthropic Messages API
directly via httpx, translating between the OpenAI format and Anthropic's
native format.
Environment variable:
| Variable | Required | Description |
|---|---|---|
ANTHROPIC_API_KEY |
Yes | Your Anthropic API key |
UPSTREAM_ANTHROPIC_URL |
No | Override the upstream URL (default: https://api.anthropic.com). Useful when running behind another proxy. |
Supported models:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
claude-opus-4-20250514 |
$15.00 | $75.00 |
claude-sonnet-4-20250514 |
$3.00 | $15.00 |
claude-haiku-4-5-20251001 |
$0.80 | $4.00 |
Request translation:
Anthropic's Messages API differs from OpenAI's format in several ways. The provider handles this automatically:
- System messages are extracted from the messages array and sent as the
top-level
systemfield (Anthropic does not accept system messages inline). - Content blocks in the response (
[{"type": "text", "text": "..."}]) are concatenated into a singlecontentstring. - Token usage fields are mapped from Anthropic's
input_tokens/output_tokensto the OpenAI-compatibleprompt_tokens/completion_tokens. - Default max_tokens is set to 4096 if not specified in the request (Anthropic requires this field).
# Internal translation (simplified)
# System message is extracted and promoted to top-level field:
payload = {
"model": request.model,
"messages": [msg for msg in request.messages if msg.role != "system"],
"max_tokens": request.max_tokens or 4096,
}
if system_message:
payload["system"] = system_message
Streaming: Uses the default stream_chat() wrapper (single-chunk). The
Anthropic provider does not currently implement native SSE streaming.
OpenAI¶
Source: src/membrain/providers/openai.py
The OpenAI provider talks to the standard /v1/chat/completions endpoint. Since
Membrain already uses the OpenAI format internally, requests are forwarded with
minimal transformation.
Environment variable:
| Variable | Required | Description |
|---|---|---|
OPENAI_API_KEY |
Yes | Your OpenAI API key |
Supported models:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
gpt-4o |
$2.50 | $10.00 |
gpt-4o-mini |
$0.15 | $0.60 |
gpt-4-turbo |
$10.00 | $30.00 |
o1 |
$15.00 | $60.00 |
o3-mini |
$1.10 | $4.40 |
Streaming: The OpenAI provider implements native SSE streaming via
stream_chat(). It sends "stream": True in the request payload and yields
parsed JSON chunks as they arrive from the OpenAI SSE endpoint:
async with self._client.stream("POST", "/chat/completions", json=body) as response:
async for line in response.aiter_lines():
if line.startswith("data: ") and line != "data: [DONE]":
yield json.loads(line[6:])
Request handling: The request payload is sent as-is via
request.model_dump(exclude_none=True) -- no translation is needed since the
internal schema matches the OpenAI API format.
Claude CLI¶
Source: src/membrain/providers/claude_cli.py
The Claude CLI provider shells out to the locally installed claude binary
(from Claude Code). It is automatically registered as a fallback when no
Anthropic API key is configured.
Prerequisites:
- The
claudeCLI must be installed and available on$PATH. - No API key is required in Membrain's config (the CLI uses its own authentication).
Environment variable:
| Variable | Required | Description |
|---|---|---|
| (none) | -- | The CLI uses its own auth. Membrain sets default_provider and default_model in config. |
Config settings:
| Setting | Default | Description |
|---|---|---|
default_provider |
"claude_cli" |
Default provider when no API keys are set |
default_model |
"sonnet" |
Default model alias for the CLI |
Supported model aliases: sonnet, opus, haiku, auto, default
How it works:
- Chat messages are concatenated into a single prompt string. System messages
are wrapped as
[System: ...], assistant messages as[Previous assistant response: ...]. - The CLI is invoked as a subprocess:
- The JSON output is parsed. The CLI returns
result,cost_usd,model,duration_ms, etc. - Token counts are estimated (the CLI does not report exact token usage):
approximately
len(text) // 4.
When to use it:
- Development/testing without API keys -- just have the Claude CLI installed.
- Local-first workflows where you want to leverage your existing Claude subscription without managing API keys.
- As an automatic fallback when the Anthropic API key is not configured.
Ollama¶
Source: src/membrain/providers/ollama.py
The Ollama provider connects to a local Ollama instance for
running open-source LLMs entirely on your own hardware. Ollama exposes an
OpenAI-compatible /v1/chat/completions endpoint, so the implementation closely
mirrors the OpenAI provider.
Environment variable:
| Variable | Required | Description |
|---|---|---|
OLLAMA_URL |
No | Ollama server URL (default: http://localhost:11434) |
Setup:
- Install Ollama: https://ollama.ai
- Pull a model:
ollama pull llama3 - Start the server:
ollama serve - Configure Membrain:
Model naming convention:
Use the ollama/ prefix when requesting Ollama models through Membrain. The
provider automatically strips this prefix before forwarding to the Ollama server:
# Request to Membrain:
{"model": "ollama/llama3", "messages": [...]}
# Forwarded to Ollama as:
{"model": "llama3", "messages": [...]}
Privacy routing:
Ollama is the designated provider for privacy-sensitive requests. When a
client sends the X-Membrain-Private: true header, the router forces all
traffic to Ollama, ensuring no data leaves the local machine. See
Routing Configuration for details.
Pricing: Ollama models have no API cost (they run locally). They are not
included in the pricing table and estimate_cost() returns $0.00 for them.
LiteLLM¶
Source: src/membrain/providers/litellm.py
The LiteLLM provider uses the LiteLLM
library to access 100+ LLM backends through a unified interface. It is
auto-registered when the litellm package is detected at startup.
Installation:
The import is deferred to __init__, so the rest of the Membrain codebase never
hard-depends on litellm. If the package is not installed, the provider is
simply not registered.
Configuration:
LiteLLM reads provider credentials from standard environment variables
(OPENAI_API_KEY, ANTHROPIC_API_KEY, COHERE_API_KEY, etc.). Consult the
LiteLLM docs for the full list.
You can also pass an API key override directly:
Supported backends include:
- OpenAI, Azure OpenAI
- Anthropic
- Google (Gemini / PaLM)
- Cohere, AI21, Replicate
- Hugging Face Inference
- AWS Bedrock, Google Vertex AI
- And many more (100+ backends)
How it works:
LiteLLM uses model name prefixes to determine which backend to call. For example:
| Model string | Backend |
|---|---|
gpt-4o |
OpenAI |
claude-sonnet-4-20250514 |
Anthropic |
gemini/gemini-pro |
|
bedrock/anthropic.claude-3-sonnet |
AWS Bedrock |
azure/gpt-4 |
Azure OpenAI |
The provider calls litellm.acompletion(), which returns an OpenAI-compatible
ModelResponse object that is then normalized to Membrain's schema.
Fallback role:
In the provider registry, LiteLLM acts as a catch-all fallback for models that don't match any specific provider prefix. The resolution order for unknown model prefixes is: LiteLLM -> OpenAI -> Claude CLI.
Routing Configuration¶
Source: src/membrain/routing/router.py
Membrain's Router class provides intelligent model selection based on quality
tier, cost constraints, privacy requirements, and provider availability. Routing
is activated when the client requests model: "auto".
Tier-Based Routing¶
Set the tier via the X-Membrain-Tier header:
| Tier | Models | Use Case |
|---|---|---|
fast |
gpt-4o-mini, claude-haiku-4-5-20251001 |
Low-latency, cost-efficient tasks |
balanced |
gpt-4o, claude-sonnet-4-20250514 |
Default -- good quality/cost balance |
best |
o1, claude-opus-4-20250514 |
Maximum quality, complex reasoning |
Example request:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Membrain-Tier: fast" \
-d '{"model": "auto", "messages": [{"role": "user", "content": "Hello"}]}'
Cost-Constrained Routing¶
Set a maximum cost per 1K tokens via the X-Membrain-Max-Cost header. The
router finds the cheapest available model that fits within the budget:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "X-Membrain-Max-Cost: 0.001" \
-d '{"model": "auto", "messages": [{"role": "user", "content": "Hello"}]}'
The cost calculation uses the average of input and output prices:
With the current pricing table, the cost ordering (cheapest first) is:
| Model | Avg cost per 1K tokens |
|---|---|
gpt-4o-mini |
$0.000375 |
claude-haiku-4-5-20251001 |
$0.0024 |
o3-mini |
$0.00275 |
gpt-4o |
$0.00625 |
claude-sonnet-4-20250514 |
$0.009 |
gpt-4-turbo |
$0.02 |
o1 |
$0.0375 |
claude-opus-4-20250514 |
$0.045 |
Privacy Routing¶
Send the X-Membrain-Private: true header to force all traffic to local-only
models. No data will leave your machine:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "X-Membrain-Private: true" \
-d '{"model": "auto", "messages": [{"role": "user", "content": "Sensitive data here"}]}'
When private mode is active:
- If Ollama is registered, the router selects ollama/llama3.
- If Ollama is not available, it falls back to local/default.
- Cloud providers (OpenAI, Anthropic) are never used.
Fallback Chains¶
When model: "auto" is used, the router builds an ordered fallback chain from
all models in the selected tier. If the primary model's provider fails, the
gateway tries the next model in the chain.
Fallback ordering is latency-aware: models are sorted by their provider's p50 latency (fastest first). The router tracks latency samples per provider (up to 100 recent samples) and computes p50 and p99 statistics.
router = Router()
# After some requests have been processed:
chain = router.resolve_with_fallback(
requested_model="auto",
available_providers=["openai", "anthropic"],
tier="balanced",
)
# Returns e.g.: ["gpt-4o", "claude-sonnet-4-20250514"]
# (ordered by provider latency)
Latency Tracking¶
The router records per-provider latency after each request:
router.record_latency("openai", duration_ms=320.5)
router.record_latency("anthropic", duration_ms=450.2)
stats = router.get_latency_stats("openai")
# {"p50": 320.5, "p99": 320.5, "count": 1}
Latency data is kept in memory (last 100 samples per provider) and used to order fallback chains so that faster providers are tried first.
Model Resolution¶
Source: src/membrain/providers/registry.py
When a request arrives, get_provider(model) in the registry maps the model
name to a provider instance using prefix matching:
| Model prefix | Provider | Fallback |
|---|---|---|
ollama/ or ollama: |
Ollama | -- |
gpt, o1, o3 |
OpenAI | LiteLLM |
claude |
Anthropic | Claude CLI |
auto, default, sonnet, opus, haiku |
Claude CLI | Anthropic |
| (anything else) | LiteLLM | OpenAI -> Claude CLI |
The full resolution logic:
def get_provider(model: str) -> BaseProvider:
if model.startswith("ollama/") or model.startswith("ollama:"):
return _providers.get("ollama")
elif model.startswith("gpt") or model.startswith("o1") or model.startswith("o3"):
return _providers.get("openai") or _providers.get("litellm")
elif model.startswith("claude"):
return _providers.get("anthropic") or _providers.get("claude_cli")
elif model in ("auto", "default", "sonnet", "opus", "haiku"):
return _providers.get("claude_cli") or _providers.get("anthropic")
else:
return _providers.get("litellm") or _providers.get("openai") or _providers.get("claude_cli")
Provider Registration¶
Source: src/membrain/providers/registry.py
Providers are initialized at startup by init_providers(), which is called with
API keys and configuration from the Settings object:
init_providers(
openai_api_key=settings.openai_api_key,
anthropic_api_key=settings.anthropic_api_key,
ollama_url=settings.ollama_url,
)
Auto-registration rules:
| Provider | Registered when... |
|---|---|
| OpenAI | OPENAI_API_KEY is set and looks valid (>10 chars, no ...) |
| Anthropic | ANTHROPIC_API_KEY is set and looks valid |
| Claude CLI | Anthropic API key is not configured (automatic fallback) |
| Ollama | OLLAMA_URL is set |
| LiteLLM | The litellm Python package is importable |
You can check which providers are active at runtime:
from membrain.providers.registry import list_providers
print(list_providers()) # e.g. ["openai", "anthropic", "ollama", "litellm"]
Adding a Custom Provider¶
To add a new LLM backend, implement BaseProvider and register it in the
registry.
Step 1: Create the provider class¶
Create a new file at src/membrain/providers/my_provider.py:
import httpx
from membrain.providers.base import BaseProvider
from membrain.schemas import (
ChatCompletionRequest,
ChatCompletionResponse,
ChatChoice,
ChatMessage,
Usage,
)
class MyProvider(BaseProvider):
"""Provider for the My LLM API."""
def __init__(self, api_key: str):
self._client = httpx.AsyncClient(
base_url="https://api.myllm.com/v1",
headers={"Authorization": f"Bearer {api_key}"},
timeout=120.0,
)
async def chat(self, request: ChatCompletionRequest) -> ChatCompletionResponse:
# Translate request to your API's format
payload = {
"model": request.model,
"messages": [
{"role": m.role, "content": m.content}
for m in request.messages
],
}
if request.temperature is not None:
payload["temperature"] = request.temperature
if request.max_tokens is not None:
payload["max_tokens"] = request.max_tokens
resp = await self._client.post("/chat/completions", json=payload)
resp.raise_for_status()
data = resp.json()
# Normalize the response to Membrain's schema
return ChatCompletionResponse(
id=data["id"],
created=data["created"],
model=data["model"],
choices=[
ChatChoice(
index=c["index"],
message=ChatMessage(**c["message"]),
finish_reason=c.get("finish_reason"),
)
for c in data["choices"]
],
usage=Usage(**data["usage"]),
)
async def close(self):
await self._client.aclose()
Step 2: Register the provider¶
Edit src/membrain/providers/registry.py to import and register your provider:
from membrain.providers.my_provider import MyProvider
def init_providers(
openai_api_key=None,
anthropic_api_key=None,
ollama_url=None,
my_api_key=None, # Add your key parameter
):
# ... existing providers ...
if _is_real_key(my_api_key):
_providers["my_provider"] = MyProvider(api_key=my_api_key)
Step 3: Add model resolution¶
In the get_provider() function, add prefix matching for your models:
Step 4: Add configuration¶
Add an environment variable to src/membrain/config.py:
Step 5 (optional): Add pricing¶
Add your models to src/membrain/cost/pricing.py:
MODEL_PRICING = {
# ... existing models ...
"myllm-standard": (1.00, 2.00), # $1.00/1M input, $2.00/1M output
}
Step 6 (optional): Add to routing tiers¶
If your model should be available via tier-based routing, add it to the
TIER_MODELS dict in src/membrain/routing/router.py:
TIER_MODELS = {
"fast": ["gpt-4o-mini", "claude-haiku-4-5-20251001", "myllm-fast"],
"balanced": ["gpt-4o", "claude-sonnet-4-20250514", "myllm-standard"],
"best": ["o1", "claude-opus-4-20250514", "myllm-pro"],
}
Pricing Reference¶
All prices are in USD per 1 million tokens. Source:
src/membrain/cost/pricing.py.
OpenAI Models¶
| Model | Input | Output |
|---|---|---|
gpt-4o |
$2.50 | $10.00 |
gpt-4o-mini |
$0.15 | $0.60 |
gpt-4-turbo |
$10.00 | $30.00 |
o1 |
$15.00 | $60.00 |
o3-mini |
$1.10 | $4.40 |
Anthropic Models¶
| Model | Input | Output |
|---|---|---|
claude-opus-4-20250514 |
$15.00 | $75.00 |
claude-sonnet-4-20250514 |
$3.00 | $15.00 |
claude-haiku-4-5-20251001 |
$0.80 | $4.00 |
Local / Other Models¶
| Provider | Cost |
|---|---|
| Ollama (any model) | Free (runs locally) |
| Claude CLI | Billed via your Claude subscription |
| LiteLLM | Varies by backend; consult LiteLLM docs |
Cost Estimation¶
Use estimate_cost() to calculate the cost of a request:
from membrain.cost.pricing import estimate_cost
cost = estimate_cost(
model="gpt-4o",
prompt_tokens=1000,
completion_tokens=500,
)
# cost = (1000 * 2.50 + 500 * 10.00) / 1_000_000 = $0.0075
If a model is not in the pricing table, estimate_cost() returns 0.0.