Meridian
Private LLM inference gateway
Meridian is a self-hosted LLM inference gateway. It sits between your application and your GPU fleet — routing requests by capability, managing priority queues, and scaling GPU instances on demand. A single Go binary with an OpenAI-compatible API. No third-party code in the data path.
Applications declare what they need — "reasoning", "fast", "long-context" — not which model to use. The gateway resolves the best available backend by capability, load, latency, and cost. Swap models or providers without changing application code.
Your models run on your GPUs: on-premise hardware or EU-headquartered cloud providers (Hetzner, OVHcloud, Scaleway, Genesis Cloud). No inference traffic touches US-jurisdiction infrastructure. No CLOUD Act exposure.
Meridian is the inference layer behind LumaVista, our AI research platform — and works equally well as a standalone gateway for any application that needs private, routed LLM inference.
Capabilities
Capability-Based Routing
Agents declare what they need — "reasoning", "fast", "long-context" — not which model to use. The gateway matches requests to the best available backend by capability, load, latency, and cost. Swap models without changing application code.
Three-Tier Priority Queue
Critical requests (real-time chat) get served first. Normal work (background processing) follows. Low-priority batch jobs fill remaining capacity. Weighted fair queuing with aging prevents starvation. Subscription tiers control concurrency, not priority.
GPU Fleet Auto-Scaling
Always-on baseline GPUs handle steady traffic. When demand spikes, the scaler provisions burst instances from EU cloud providers. Cooling-down instances backfill with batch work until their billing hour expires. Budget guards prevent runaway costs.
Complete Data Sovereignty
No third-party proxy, no external telemetry, no inference API that sees your prompts. Your models run on your GPUs — on-premise or at EU-headquartered providers with zero US CLOUD Act exposure. The gateway is a single Go binary you deploy and control.
GPU Fleet Dashboard
Real-time visibility into every GPU instance — utilization, temperature, throughput, cost rate, health status. Embedded admin UI with live queue depth, scaling timeline, billing breakdown, and per-tenant usage. Dynamic configuration without restarts.
Prometheus + Webhooks
Native Prometheus metrics for long-term analytics — request latency, token throughput, queue depth, GPU utilization, cost tracking. Configurable webhook alerts for Slack, PagerDuty, or any endpoint. Budget thresholds, health alerts, scaling notifications — all customizable at runtime.
Technical specifications
| Language | Go |
| API compatibility | OpenAI chat/completions (streaming + non-streaming) |
| Supported engines | vLLM, SGLang, TensorRT-LLM, Ollama, any OpenAI-compatible |
| Protocol | HTTP/1.1 + SSE, gRPC (planned) |
| Deployment | Embedded Go library, standalone Docker image, managed SaaS (planned) |
| Observability | Prometheus metrics, webhook alerts, embedded dashboard |
| Scaling providers | Hetzner, OVHcloud, Scaleway, Genesis Cloud |
| Authentication | API key per tenant, mTLS between gateway and backends |
| Min. requirements | Single-core, 128 MB RAM (gateway only, excl. inference engines) |
Deployment modes
Embedded Library
Import as a Go module. Zero network overhead. The gateway runs in-process alongside your application.
go get lumavista.eu/meridian Standalone Service
OpenAI-compatible API. Drop-in replacement for LiteLLM, OpenRouter, or any inference proxy. Single Docker image.
docker run meridian Managed SaaS
We run it for you on EU infrastructure. Multi-tenant with per-key isolation. Pay per token plus platform fee.
Coming soon