LLM Provider Configuration¶

Membrain acts as a unified gateway that routes requests to multiple LLM backends. This guide covers how the provider abstraction works, how to configure each supported provider, and how intelligent routing selects the best model for each request.

Overview¶

Every LLM backend in Membrain implements the BaseProvider abstract class defined in src/membrain/providers/base.py. The interface is intentionally small:

class BaseProvider(ABC):
    @abstractmethod
    async def chat(self, request: ChatCompletionRequest) -> ChatCompletionResponse:
        ...

    async def stream_chat(self, model, messages, temperature=None, max_tokens=None, **kwargs):
        # Default: wraps chat() as a single chunk. Override for true SSE streaming.
        ...

    async def close(self):
        # Release HTTP clients, connections, etc.
        ...

Key design decisions:

All providers accept and return the OpenAI-compatible format defined in src/membrain/schemas.py (ChatCompletionRequest / ChatCompletionResponse). Providers that talk to non-OpenAI APIs (e.g., Anthropic) translate internally.
stream_chat() has a default implementation that wraps the non-streaming chat() response as a single SSE chunk. Providers with native streaming (OpenAI) override this for true token-by-token delivery.
close() is called during shutdown to release resources (httpx clients, etc.).

Clients send standard OpenAI-format requests to Membrain's /v1/chat/completions endpoint. The gateway resolves the model field to a provider, forwards the request, and returns the normalized response.

Provider Setup¶

Anthropic¶

Source: src/membrain/providers/anthropic.py

The Anthropic provider calls the Anthropic Messages API directly via httpx, translating between the OpenAI format and Anthropic's native format.

Environment variable:

Variable	Required	Description
`ANTHROPIC_API_KEY`	Yes	Your Anthropic API key
`UPSTREAM_ANTHROPIC_URL`	No	Override the upstream URL (default: `https://api.anthropic.com`). Useful when running behind another proxy.

Supported models:

Model	Input (per 1M tokens)	Output (per 1M tokens)
`claude-opus-4-20250514`	$15.00	$75.00
`claude-sonnet-4-20250514`	$3.00	$15.00
`claude-haiku-4-5-20251001`	$0.80	$4.00

Request translation:

Anthropic's Messages API differs from OpenAI's format in several ways. The provider handles this automatically:

System messages are extracted from the messages array and sent as the top-level system field (Anthropic does not accept system messages inline).
Content blocks in the response ([{"type": "text", "text": "..."}]) are concatenated into a single content string.
Token usage fields are mapped from Anthropic's input_tokens / output_tokens to the OpenAI-compatible prompt_tokens / completion_tokens.
Default max_tokens is set to 4096 if not specified in the request (Anthropic requires this field).

# Internal translation (simplified)
# System message is extracted and promoted to top-level field:
payload = {
    "model": request.model,
    "messages": [msg for msg in request.messages if msg.role != "system"],
    "max_tokens": request.max_tokens or 4096,
}
if system_message:
    payload["system"] = system_message

Streaming: Uses the default stream_chat() wrapper (single-chunk). The Anthropic provider does not currently implement native SSE streaming.

OpenAI¶

Source: src/membrain/providers/openai.py

The OpenAI provider talks to the standard /v1/chat/completions endpoint. Since Membrain already uses the OpenAI format internally, requests are forwarded with minimal transformation.

Environment variable:

Variable	Required	Description
`OPENAI_API_KEY`	Yes	Your OpenAI API key

Supported models:

Model	Input (per 1M tokens)	Output (per 1M tokens)
`gpt-4o`	$2.50	$10.00
`gpt-4o-mini`	$0.15	$0.60
`gpt-4-turbo`	$10.00	$30.00
`o1`	$15.00	$60.00
`o3-mini`	$1.10	$4.40

Streaming: The OpenAI provider implements native SSE streaming via stream_chat(). It sends "stream": True in the request payload and yields parsed JSON chunks as they arrive from the OpenAI SSE endpoint:

async with self._client.stream("POST", "/chat/completions", json=body) as response:
    async for line in response.aiter_lines():
        if line.startswith("data: ") and line != "data: [DONE]":
            yield json.loads(line[6:])

Request handling: The request payload is sent as-is via request.model_dump(exclude_none=True) -- no translation is needed since the internal schema matches the OpenAI API format.

Claude CLI¶

Source: src/membrain/providers/claude_cli.py

The Claude CLI provider shells out to the locally installed claude binary (from Claude Code). It is automatically registered as a fallback when no Anthropic API key is configured.

Prerequisites:

The claude CLI must be installed and available on $PATH.
No API key is required in Membrain's config (the CLI uses its own authentication).

Environment variable:

Variable	Required	Description
(none)	--	The CLI uses its own auth. Membrain sets `default_provider` and `default_model` in config.

Config settings:

Setting	Default	Description
`default_provider`	`"claude_cli"`	Default provider when no API keys are set
`default_model`	`"sonnet"`	Default model alias for the CLI

Supported model aliases: sonnet, opus, haiku, auto, default

How it works:

Chat messages are concatenated into a single prompt string. System messages are wrapped as [System: ...], assistant messages as [Previous assistant response: ...].

The CLI is invoked as a subprocess:

claude -p "<prompt>" --output-format json --model <model> --no-session-persistence

The JSON output is parsed. The CLI returns result, cost_usd, model, duration_ms, etc.
Token counts are estimated (the CLI does not report exact token usage): approximately len(text) // 4.

When to use it:

Development/testing without API keys -- just have the Claude CLI installed.
Local-first workflows where you want to leverage your existing Claude subscription without managing API keys.
As an automatic fallback when the Anthropic API key is not configured.

Ollama¶

Source: src/membrain/providers/ollama.py

The Ollama provider connects to a local Ollama instance for running open-source LLMs entirely on your own hardware. Ollama exposes an OpenAI-compatible /v1/chat/completions endpoint, so the implementation closely mirrors the OpenAI provider.

Environment variable:

Variable	Required	Description
`OLLAMA_URL`	No	Ollama server URL (default: `http://localhost:11434`)

Setup:

Install Ollama: https://ollama.ai
Pull a model: ollama pull llama3
Start the server: ollama serve

Configure Membrain:

export OLLAMA_URL=http://localhost:11434

Model naming convention:

Use the ollama/ prefix when requesting Ollama models through Membrain. The provider automatically strips this prefix before forwarding to the Ollama server:

# Request to Membrain:
{"model": "ollama/llama3", "messages": [...]}

# Forwarded to Ollama as:
{"model": "llama3", "messages": [...]}

Privacy routing:

Ollama is the designated provider for privacy-sensitive requests. When a client sends the X-Membrain-Private: true header, the router forces all traffic to Ollama, ensuring no data leaves the local machine. See Routing Configuration for details.

Pricing: Ollama models have no API cost (they run locally). They are not included in the pricing table and estimate_cost() returns $0.00 for them.

LiteLLM¶

Source: src/membrain/providers/litellm.py

The LiteLLM provider uses the LiteLLM library to access 100+ LLM backends through a unified interface. It is auto-registered when the litellm package is detected at startup.

Installation:

pip install membrain[litellm]

The import is deferred to __init__, so the rest of the Membrain codebase never hard-depends on litellm. If the package is not installed, the provider is simply not registered.

Configuration:

LiteLLM reads provider credentials from standard environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY, COHERE_API_KEY, etc.). Consult the LiteLLM docs for the full list.

You can also pass an API key override directly:

LiteLLMProvider(api_key="sk-...")

Supported backends include:

OpenAI, Azure OpenAI
Anthropic
Google (Gemini / PaLM)
Cohere, AI21, Replicate
Hugging Face Inference
AWS Bedrock, Google Vertex AI
And many more (100+ backends)

How it works:

LiteLLM uses model name prefixes to determine which backend to call. For example:

Model string	Backend
`gpt-4o`	OpenAI
`claude-sonnet-4-20250514`	Anthropic
`gemini/gemini-pro`	Google
`bedrock/anthropic.claude-3-sonnet`	AWS Bedrock
`azure/gpt-4`	Azure OpenAI

The provider calls litellm.acompletion(), which returns an OpenAI-compatible ModelResponse object that is then normalized to Membrain's schema.

Fallback role:

In the provider registry, LiteLLM acts as a catch-all fallback for models that don't match any specific provider prefix. The resolution order for unknown model prefixes is: LiteLLM -> OpenAI -> Claude CLI.

Routing Configuration¶

Source: src/membrain/routing/router.py

Membrain's Router class provides intelligent model selection based on quality tier, cost constraints, privacy requirements, and provider availability. Routing is activated when the client requests model: "auto".

Tier-Based Routing¶

Set the tier via the X-Membrain-Tier header:

Tier	Models	Use Case
`fast`	`gpt-4o-mini`, `claude-haiku-4-5-20251001`	Low-latency, cost-efficient tasks
`balanced`	`gpt-4o`, `claude-sonnet-4-20250514`	Default -- good quality/cost balance
`best`	`o1`, `claude-opus-4-20250514`	Maximum quality, complex reasoning

Example request:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Membrain-Tier: fast" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "Hello"}]}'

Cost-Constrained Routing¶

Set a maximum cost per 1K tokens via the X-Membrain-Max-Cost header. The router finds the cheapest available model that fits within the budget:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "X-Membrain-Max-Cost: 0.001" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "Hello"}]}'

The cost calculation uses the average of input and output prices:

avg_cost_1k = (input_price * 500 + output_price * 500) / 1_000_000

With the current pricing table, the cost ordering (cheapest first) is:

Model	Avg cost per 1K tokens
`gpt-4o-mini`	$0.000375
`claude-haiku-4-5-20251001`	$0.0024
`o3-mini`	$0.00275
`gpt-4o`	$0.00625
`claude-sonnet-4-20250514`	$0.009
`gpt-4-turbo`	$0.02
`o1`	$0.0375
`claude-opus-4-20250514`	$0.045

Privacy Routing¶

Send the X-Membrain-Private: true header to force all traffic to local-only models. No data will leave your machine:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "X-Membrain-Private: true" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "Sensitive data here"}]}'

When private mode is active: - If Ollama is registered, the router selects ollama/llama3. - If Ollama is not available, it falls back to local/default. - Cloud providers (OpenAI, Anthropic) are never used.

Fallback Chains¶

When model: "auto" is used, the router builds an ordered fallback chain from all models in the selected tier. If the primary model's provider fails, the gateway tries the next model in the chain.

Fallback ordering is latency-aware: models are sorted by their provider's p50 latency (fastest first). The router tracks latency samples per provider (up to 100 recent samples) and computes p50 and p99 statistics.

router = Router()

# After some requests have been processed:
chain = router.resolve_with_fallback(
    requested_model="auto",
    available_providers=["openai", "anthropic"],
    tier="balanced",
)
# Returns e.g.: ["gpt-4o", "claude-sonnet-4-20250514"]
# (ordered by provider latency)

Latency Tracking¶

The router records per-provider latency after each request:

router.record_latency("openai", duration_ms=320.5)
router.record_latency("anthropic", duration_ms=450.2)

stats = router.get_latency_stats("openai")
# {"p50": 320.5, "p99": 320.5, "count": 1}

Latency data is kept in memory (last 100 samples per provider) and used to order fallback chains so that faster providers are tried first.

Model Resolution¶

Source: src/membrain/providers/registry.py

When a request arrives, get_provider(model) in the registry maps the model name to a provider instance using prefix matching:

Model prefix	Provider	Fallback
`ollama/` or `ollama:`	Ollama	--
`gpt`, `o1`, `o3`	OpenAI	LiteLLM
`claude`	Anthropic	Claude CLI
`auto`, `default`, `sonnet`, `opus`, `haiku`	Claude CLI	Anthropic
(anything else)	LiteLLM	OpenAI -> Claude CLI

The full resolution logic:

def get_provider(model: str) -> BaseProvider:
    if model.startswith("ollama/") or model.startswith("ollama:"):
        return _providers.get("ollama")
    elif model.startswith("gpt") or model.startswith("o1") or model.startswith("o3"):
        return _providers.get("openai") or _providers.get("litellm")
    elif model.startswith("claude"):
        return _providers.get("anthropic") or _providers.get("claude_cli")
    elif model in ("auto", "default", "sonnet", "opus", "haiku"):
        return _providers.get("claude_cli") or _providers.get("anthropic")
    else:
        return _providers.get("litellm") or _providers.get("openai") or _providers.get("claude_cli")

Provider Registration¶

Source: src/membrain/providers/registry.py

Providers are initialized at startup by init_providers(), which is called with API keys and configuration from the Settings object:

init_providers(
    openai_api_key=settings.openai_api_key,
    anthropic_api_key=settings.anthropic_api_key,
    ollama_url=settings.ollama_url,
)

Auto-registration rules:

Provider	Registered when...
OpenAI	`OPENAI_API_KEY` is set and looks valid (>10 chars, no `...`)
Anthropic	`ANTHROPIC_API_KEY` is set and looks valid
Claude CLI	Anthropic API key is not configured (automatic fallback)
Ollama	`OLLAMA_URL` is set
LiteLLM	The `litellm` Python package is importable

You can check which providers are active at runtime:

from membrain.providers.registry import list_providers
print(list_providers())  # e.g. ["openai", "anthropic", "ollama", "litellm"]

Adding a Custom Provider¶

To add a new LLM backend, implement BaseProvider and register it in the registry.

Step 1: Create the provider class¶

Create a new file at src/membrain/providers/my_provider.py:

import httpx
from membrain.providers.base import BaseProvider
from membrain.schemas import (
    ChatCompletionRequest,
    ChatCompletionResponse,
    ChatChoice,
    ChatMessage,
    Usage,
)


class MyProvider(BaseProvider):
    """Provider for the My LLM API."""

    def __init__(self, api_key: str):
        self._client = httpx.AsyncClient(
            base_url="https://api.myllm.com/v1",
            headers={"Authorization": f"Bearer {api_key}"},
            timeout=120.0,
        )

    async def chat(self, request: ChatCompletionRequest) -> ChatCompletionResponse:
        # Translate request to your API's format
        payload = {
            "model": request.model,
            "messages": [
                {"role": m.role, "content": m.content}
                for m in request.messages
            ],
        }
        if request.temperature is not None:
            payload["temperature"] = request.temperature
        if request.max_tokens is not None:
            payload["max_tokens"] = request.max_tokens

        resp = await self._client.post("/chat/completions", json=payload)
        resp.raise_for_status()
        data = resp.json()

        # Normalize the response to Membrain's schema
        return ChatCompletionResponse(
            id=data["id"],
            created=data["created"],
            model=data["model"],
            choices=[
                ChatChoice(
                    index=c["index"],
                    message=ChatMessage(**c["message"]),
                    finish_reason=c.get("finish_reason"),
                )
                for c in data["choices"]
            ],
            usage=Usage(**data["usage"]),
        )

    async def close(self):
        await self._client.aclose()

Step 2: Register the provider¶

Edit src/membrain/providers/registry.py to import and register your provider:

from membrain.providers.my_provider import MyProvider

def init_providers(
    openai_api_key=None,
    anthropic_api_key=None,
    ollama_url=None,
    my_api_key=None,      # Add your key parameter
):
    # ... existing providers ...

    if _is_real_key(my_api_key):
        _providers["my_provider"] = MyProvider(api_key=my_api_key)

Step 3: Add model resolution¶

In the get_provider() function, add prefix matching for your models:

elif model.startswith("myllm"):
    provider = _providers.get("my_provider")

Step 4: Add configuration¶

Add an environment variable to src/membrain/config.py:

class Settings(BaseSettings):
    # ... existing fields ...
    my_api_key: str | None = None

Step 5 (optional): Add pricing¶

Add your models to src/membrain/cost/pricing.py:

MODEL_PRICING = {
    # ... existing models ...
    "myllm-standard": (1.00, 2.00),  # $1.00/1M input, $2.00/1M output
}

Step 6 (optional): Add to routing tiers¶

If your model should be available via tier-based routing, add it to the TIER_MODELS dict in src/membrain/routing/router.py:

TIER_MODELS = {
    "fast": ["gpt-4o-mini", "claude-haiku-4-5-20251001", "myllm-fast"],
    "balanced": ["gpt-4o", "claude-sonnet-4-20250514", "myllm-standard"],
    "best": ["o1", "claude-opus-4-20250514", "myllm-pro"],
}

Pricing Reference¶

All prices are in USD per 1 million tokens. Source: src/membrain/cost/pricing.py.

OpenAI Models¶

Model	Input	Output
`gpt-4o`	$2.50	$10.00
`gpt-4o-mini`	$0.15	$0.60
`gpt-4-turbo`	$10.00	$30.00
`o1`	$15.00	$60.00
`o3-mini`	$1.10	$4.40

Anthropic Models¶

Model	Input	Output
`claude-opus-4-20250514`	$15.00	$75.00
`claude-sonnet-4-20250514`	$3.00	$15.00
`claude-haiku-4-5-20251001`	$0.80	$4.00

Local / Other Models¶

Provider	Cost
Ollama (any model)	Free (runs locally)
Claude CLI	Billed via your Claude subscription
LiteLLM	Varies by backend; consult LiteLLM docs

Cost Estimation¶

Use estimate_cost() to calculate the cost of a request:

from membrain.cost.pricing import estimate_cost

cost = estimate_cost(
    model="gpt-4o",
    prompt_tokens=1000,
    completion_tokens=500,
)
# cost = (1000 * 2.50 + 500 * 10.00) / 1_000_000 = $0.0075

If a model is not in the pricing table, estimate_cost() returns 0.0.