Skip to content

LLM Provider Configuration

Membrain acts as a unified gateway that routes requests to multiple LLM backends. This guide covers how the provider abstraction works, how to configure each supported provider, and how intelligent routing selects the best model for each request.


Overview

Every LLM backend in Membrain implements the BaseProvider abstract class defined in src/membrain/providers/base.py. The interface is intentionally small:

class BaseProvider(ABC):
    @abstractmethod
    async def chat(self, request: ChatCompletionRequest) -> ChatCompletionResponse:
        ...

    async def stream_chat(self, model, messages, temperature=None, max_tokens=None, **kwargs):
        # Default: wraps chat() as a single chunk. Override for true SSE streaming.
        ...

    async def close(self):
        # Release HTTP clients, connections, etc.
        ...

Key design decisions:

  • All providers accept and return the OpenAI-compatible format defined in src/membrain/schemas.py (ChatCompletionRequest / ChatCompletionResponse). Providers that talk to non-OpenAI APIs (e.g., Anthropic) translate internally.
  • stream_chat() has a default implementation that wraps the non-streaming chat() response as a single SSE chunk. Providers with native streaming (OpenAI) override this for true token-by-token delivery.
  • close() is called during shutdown to release resources (httpx clients, etc.).

Clients send standard OpenAI-format requests to Membrain's /v1/chat/completions endpoint. The gateway resolves the model field to a provider, forwards the request, and returns the normalized response.


Provider Setup

Anthropic

Source: src/membrain/providers/anthropic.py

The Anthropic provider calls the Anthropic Messages API directly via httpx, translating between the OpenAI format and Anthropic's native format.

Environment variable:

Variable Required Description
ANTHROPIC_API_KEY Yes Your Anthropic API key
UPSTREAM_ANTHROPIC_URL No Override the upstream URL (default: https://api.anthropic.com). Useful when running behind another proxy.

Supported models:

Model Input (per 1M tokens) Output (per 1M tokens)
claude-opus-4-20250514 $15.00 $75.00
claude-sonnet-4-20250514 $3.00 $15.00
claude-haiku-4-5-20251001 $0.80 $4.00

Request translation:

Anthropic's Messages API differs from OpenAI's format in several ways. The provider handles this automatically:

  1. System messages are extracted from the messages array and sent as the top-level system field (Anthropic does not accept system messages inline).
  2. Content blocks in the response ([{"type": "text", "text": "..."}]) are concatenated into a single content string.
  3. Token usage fields are mapped from Anthropic's input_tokens / output_tokens to the OpenAI-compatible prompt_tokens / completion_tokens.
  4. Default max_tokens is set to 4096 if not specified in the request (Anthropic requires this field).
# Internal translation (simplified)
# System message is extracted and promoted to top-level field:
payload = {
    "model": request.model,
    "messages": [msg for msg in request.messages if msg.role != "system"],
    "max_tokens": request.max_tokens or 4096,
}
if system_message:
    payload["system"] = system_message

Streaming: Uses the default stream_chat() wrapper (single-chunk). The Anthropic provider does not currently implement native SSE streaming.


OpenAI

Source: src/membrain/providers/openai.py

The OpenAI provider talks to the standard /v1/chat/completions endpoint. Since Membrain already uses the OpenAI format internally, requests are forwarded with minimal transformation.

Environment variable:

Variable Required Description
OPENAI_API_KEY Yes Your OpenAI API key

Supported models:

Model Input (per 1M tokens) Output (per 1M tokens)
gpt-4o $2.50 $10.00
gpt-4o-mini $0.15 $0.60
gpt-4-turbo $10.00 $30.00
o1 $15.00 $60.00
o3-mini $1.10 $4.40

Streaming: The OpenAI provider implements native SSE streaming via stream_chat(). It sends "stream": True in the request payload and yields parsed JSON chunks as they arrive from the OpenAI SSE endpoint:

async with self._client.stream("POST", "/chat/completions", json=body) as response:
    async for line in response.aiter_lines():
        if line.startswith("data: ") and line != "data: [DONE]":
            yield json.loads(line[6:])

Request handling: The request payload is sent as-is via request.model_dump(exclude_none=True) -- no translation is needed since the internal schema matches the OpenAI API format.


Claude CLI

Source: src/membrain/providers/claude_cli.py

The Claude CLI provider shells out to the locally installed claude binary (from Claude Code). It is automatically registered as a fallback when no Anthropic API key is configured.

Prerequisites:

  • The claude CLI must be installed and available on $PATH.
  • No API key is required in Membrain's config (the CLI uses its own authentication).

Environment variable:

Variable Required Description
(none) -- The CLI uses its own auth. Membrain sets default_provider and default_model in config.

Config settings:

Setting Default Description
default_provider "claude_cli" Default provider when no API keys are set
default_model "sonnet" Default model alias for the CLI

Supported model aliases: sonnet, opus, haiku, auto, default

How it works:

  1. Chat messages are concatenated into a single prompt string. System messages are wrapped as [System: ...], assistant messages as [Previous assistant response: ...].
  2. The CLI is invoked as a subprocess:
    claude -p "<prompt>" --output-format json --model <model> --no-session-persistence
    
  3. The JSON output is parsed. The CLI returns result, cost_usd, model, duration_ms, etc.
  4. Token counts are estimated (the CLI does not report exact token usage): approximately len(text) // 4.

When to use it:

  • Development/testing without API keys -- just have the Claude CLI installed.
  • Local-first workflows where you want to leverage your existing Claude subscription without managing API keys.
  • As an automatic fallback when the Anthropic API key is not configured.

Ollama

Source: src/membrain/providers/ollama.py

The Ollama provider connects to a local Ollama instance for running open-source LLMs entirely on your own hardware. Ollama exposes an OpenAI-compatible /v1/chat/completions endpoint, so the implementation closely mirrors the OpenAI provider.

Environment variable:

Variable Required Description
OLLAMA_URL No Ollama server URL (default: http://localhost:11434)

Setup:

  1. Install Ollama: https://ollama.ai
  2. Pull a model: ollama pull llama3
  3. Start the server: ollama serve
  4. Configure Membrain:
    export OLLAMA_URL=http://localhost:11434
    

Model naming convention:

Use the ollama/ prefix when requesting Ollama models through Membrain. The provider automatically strips this prefix before forwarding to the Ollama server:

# Request to Membrain:
{"model": "ollama/llama3", "messages": [...]}

# Forwarded to Ollama as:
{"model": "llama3", "messages": [...]}

Privacy routing:

Ollama is the designated provider for privacy-sensitive requests. When a client sends the X-Membrain-Private: true header, the router forces all traffic to Ollama, ensuring no data leaves the local machine. See Routing Configuration for details.

Pricing: Ollama models have no API cost (they run locally). They are not included in the pricing table and estimate_cost() returns $0.00 for them.


LiteLLM

Source: src/membrain/providers/litellm.py

The LiteLLM provider uses the LiteLLM library to access 100+ LLM backends through a unified interface. It is auto-registered when the litellm package is detected at startup.

Installation:

pip install membrain[litellm]

The import is deferred to __init__, so the rest of the Membrain codebase never hard-depends on litellm. If the package is not installed, the provider is simply not registered.

Configuration:

LiteLLM reads provider credentials from standard environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY, COHERE_API_KEY, etc.). Consult the LiteLLM docs for the full list.

You can also pass an API key override directly:

LiteLLMProvider(api_key="sk-...")

Supported backends include:

  • OpenAI, Azure OpenAI
  • Anthropic
  • Google (Gemini / PaLM)
  • Cohere, AI21, Replicate
  • Hugging Face Inference
  • AWS Bedrock, Google Vertex AI
  • And many more (100+ backends)

How it works:

LiteLLM uses model name prefixes to determine which backend to call. For example:

Model string Backend
gpt-4o OpenAI
claude-sonnet-4-20250514 Anthropic
gemini/gemini-pro Google
bedrock/anthropic.claude-3-sonnet AWS Bedrock
azure/gpt-4 Azure OpenAI

The provider calls litellm.acompletion(), which returns an OpenAI-compatible ModelResponse object that is then normalized to Membrain's schema.

Fallback role:

In the provider registry, LiteLLM acts as a catch-all fallback for models that don't match any specific provider prefix. The resolution order for unknown model prefixes is: LiteLLM -> OpenAI -> Claude CLI.


Routing Configuration

Source: src/membrain/routing/router.py

Membrain's Router class provides intelligent model selection based on quality tier, cost constraints, privacy requirements, and provider availability. Routing is activated when the client requests model: "auto".

Tier-Based Routing

Set the tier via the X-Membrain-Tier header:

Tier Models Use Case
fast gpt-4o-mini, claude-haiku-4-5-20251001 Low-latency, cost-efficient tasks
balanced gpt-4o, claude-sonnet-4-20250514 Default -- good quality/cost balance
best o1, claude-opus-4-20250514 Maximum quality, complex reasoning

Example request:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Membrain-Tier: fast" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "Hello"}]}'

Cost-Constrained Routing

Set a maximum cost per 1K tokens via the X-Membrain-Max-Cost header. The router finds the cheapest available model that fits within the budget:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "X-Membrain-Max-Cost: 0.001" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "Hello"}]}'

The cost calculation uses the average of input and output prices:

avg_cost_1k = (input_price * 500 + output_price * 500) / 1_000_000

With the current pricing table, the cost ordering (cheapest first) is:

Model Avg cost per 1K tokens
gpt-4o-mini $0.000375
claude-haiku-4-5-20251001 $0.0024
o3-mini $0.00275
gpt-4o $0.00625
claude-sonnet-4-20250514 $0.009
gpt-4-turbo $0.02
o1 $0.0375
claude-opus-4-20250514 $0.045

Privacy Routing

Send the X-Membrain-Private: true header to force all traffic to local-only models. No data will leave your machine:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "X-Membrain-Private: true" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "Sensitive data here"}]}'

When private mode is active: - If Ollama is registered, the router selects ollama/llama3. - If Ollama is not available, it falls back to local/default. - Cloud providers (OpenAI, Anthropic) are never used.

Fallback Chains

When model: "auto" is used, the router builds an ordered fallback chain from all models in the selected tier. If the primary model's provider fails, the gateway tries the next model in the chain.

Fallback ordering is latency-aware: models are sorted by their provider's p50 latency (fastest first). The router tracks latency samples per provider (up to 100 recent samples) and computes p50 and p99 statistics.

router = Router()

# After some requests have been processed:
chain = router.resolve_with_fallback(
    requested_model="auto",
    available_providers=["openai", "anthropic"],
    tier="balanced",
)
# Returns e.g.: ["gpt-4o", "claude-sonnet-4-20250514"]
# (ordered by provider latency)

Latency Tracking

The router records per-provider latency after each request:

router.record_latency("openai", duration_ms=320.5)
router.record_latency("anthropic", duration_ms=450.2)

stats = router.get_latency_stats("openai")
# {"p50": 320.5, "p99": 320.5, "count": 1}

Latency data is kept in memory (last 100 samples per provider) and used to order fallback chains so that faster providers are tried first.


Model Resolution

Source: src/membrain/providers/registry.py

When a request arrives, get_provider(model) in the registry maps the model name to a provider instance using prefix matching:

Model prefix Provider Fallback
ollama/ or ollama: Ollama --
gpt, o1, o3 OpenAI LiteLLM
claude Anthropic Claude CLI
auto, default, sonnet, opus, haiku Claude CLI Anthropic
(anything else) LiteLLM OpenAI -> Claude CLI

The full resolution logic:

def get_provider(model: str) -> BaseProvider:
    if model.startswith("ollama/") or model.startswith("ollama:"):
        return _providers.get("ollama")
    elif model.startswith("gpt") or model.startswith("o1") or model.startswith("o3"):
        return _providers.get("openai") or _providers.get("litellm")
    elif model.startswith("claude"):
        return _providers.get("anthropic") or _providers.get("claude_cli")
    elif model in ("auto", "default", "sonnet", "opus", "haiku"):
        return _providers.get("claude_cli") or _providers.get("anthropic")
    else:
        return _providers.get("litellm") or _providers.get("openai") or _providers.get("claude_cli")

Provider Registration

Source: src/membrain/providers/registry.py

Providers are initialized at startup by init_providers(), which is called with API keys and configuration from the Settings object:

init_providers(
    openai_api_key=settings.openai_api_key,
    anthropic_api_key=settings.anthropic_api_key,
    ollama_url=settings.ollama_url,
)

Auto-registration rules:

Provider Registered when...
OpenAI OPENAI_API_KEY is set and looks valid (>10 chars, no ...)
Anthropic ANTHROPIC_API_KEY is set and looks valid
Claude CLI Anthropic API key is not configured (automatic fallback)
Ollama OLLAMA_URL is set
LiteLLM The litellm Python package is importable

You can check which providers are active at runtime:

from membrain.providers.registry import list_providers
print(list_providers())  # e.g. ["openai", "anthropic", "ollama", "litellm"]

Adding a Custom Provider

To add a new LLM backend, implement BaseProvider and register it in the registry.

Step 1: Create the provider class

Create a new file at src/membrain/providers/my_provider.py:

import httpx
from membrain.providers.base import BaseProvider
from membrain.schemas import (
    ChatCompletionRequest,
    ChatCompletionResponse,
    ChatChoice,
    ChatMessage,
    Usage,
)


class MyProvider(BaseProvider):
    """Provider for the My LLM API."""

    def __init__(self, api_key: str):
        self._client = httpx.AsyncClient(
            base_url="https://api.myllm.com/v1",
            headers={"Authorization": f"Bearer {api_key}"},
            timeout=120.0,
        )

    async def chat(self, request: ChatCompletionRequest) -> ChatCompletionResponse:
        # Translate request to your API's format
        payload = {
            "model": request.model,
            "messages": [
                {"role": m.role, "content": m.content}
                for m in request.messages
            ],
        }
        if request.temperature is not None:
            payload["temperature"] = request.temperature
        if request.max_tokens is not None:
            payload["max_tokens"] = request.max_tokens

        resp = await self._client.post("/chat/completions", json=payload)
        resp.raise_for_status()
        data = resp.json()

        # Normalize the response to Membrain's schema
        return ChatCompletionResponse(
            id=data["id"],
            created=data["created"],
            model=data["model"],
            choices=[
                ChatChoice(
                    index=c["index"],
                    message=ChatMessage(**c["message"]),
                    finish_reason=c.get("finish_reason"),
                )
                for c in data["choices"]
            ],
            usage=Usage(**data["usage"]),
        )

    async def close(self):
        await self._client.aclose()

Step 2: Register the provider

Edit src/membrain/providers/registry.py to import and register your provider:

from membrain.providers.my_provider import MyProvider

def init_providers(
    openai_api_key=None,
    anthropic_api_key=None,
    ollama_url=None,
    my_api_key=None,      # Add your key parameter
):
    # ... existing providers ...

    if _is_real_key(my_api_key):
        _providers["my_provider"] = MyProvider(api_key=my_api_key)

Step 3: Add model resolution

In the get_provider() function, add prefix matching for your models:

elif model.startswith("myllm"):
    provider = _providers.get("my_provider")

Step 4: Add configuration

Add an environment variable to src/membrain/config.py:

class Settings(BaseSettings):
    # ... existing fields ...
    my_api_key: str | None = None

Step 5 (optional): Add pricing

Add your models to src/membrain/cost/pricing.py:

MODEL_PRICING = {
    # ... existing models ...
    "myllm-standard": (1.00, 2.00),  # $1.00/1M input, $2.00/1M output
}

Step 6 (optional): Add to routing tiers

If your model should be available via tier-based routing, add it to the TIER_MODELS dict in src/membrain/routing/router.py:

TIER_MODELS = {
    "fast": ["gpt-4o-mini", "claude-haiku-4-5-20251001", "myllm-fast"],
    "balanced": ["gpt-4o", "claude-sonnet-4-20250514", "myllm-standard"],
    "best": ["o1", "claude-opus-4-20250514", "myllm-pro"],
}

Pricing Reference

All prices are in USD per 1 million tokens. Source: src/membrain/cost/pricing.py.

OpenAI Models

Model Input Output
gpt-4o $2.50 $10.00
gpt-4o-mini $0.15 $0.60
gpt-4-turbo $10.00 $30.00
o1 $15.00 $60.00
o3-mini $1.10 $4.40

Anthropic Models

Model Input Output
claude-opus-4-20250514 $15.00 $75.00
claude-sonnet-4-20250514 $3.00 $15.00
claude-haiku-4-5-20251001 $0.80 $4.00

Local / Other Models

Provider Cost
Ollama (any model) Free (runs locally)
Claude CLI Billed via your Claude subscription
LiteLLM Varies by backend; consult LiteLLM docs

Cost Estimation

Use estimate_cost() to calculate the cost of a request:

from membrain.cost.pricing import estimate_cost

cost = estimate_cost(
    model="gpt-4o",
    prompt_tokens=1000,
    completion_tokens=500,
)
# cost = (1000 * 2.50 + 500 * 10.00) / 1_000_000 = $0.0075

If a model is not in the pricing table, estimate_cost() returns 0.0.