Skip to content
Coming Soon EU Sovereign

Meridian

Private LLM inference gateway

Abstract visualization of request routing through a private inference gateway

Meridian is a self-hosted LLM inference gateway. It sits between your application and your GPU fleet — routing requests by capability, managing priority queues, and scaling GPU instances on demand. A single Go binary with an OpenAI-compatible API. No third-party code in the data path.

Applications declare what they need — "reasoning", "fast", "long-context" — not which model to use. The gateway resolves the best available backend by capability, load, latency, and cost. Swap models or providers without changing application code.

Your models run on your GPUs: on-premise hardware or EU-headquartered cloud providers (Hetzner, OVHcloud, Scaleway, Genesis Cloud). No inference traffic touches US-jurisdiction infrastructure. No CLOUD Act exposure.

Meridian is the inference layer behind LumaVista, our AI research platform — and works equally well as a standalone gateway for any application that needs private, routed LLM inference.

Capabilities

Capability-Based Routing

Agents declare what they need — "reasoning", "fast", "long-context" — not which model to use. The gateway matches requests to the best available backend by capability, load, latency, and cost. Swap models without changing application code.

Three-Tier Priority Queue

Critical requests (real-time chat) get served first. Normal work (background processing) follows. Low-priority batch jobs fill remaining capacity. Weighted fair queuing with aging prevents starvation. Subscription tiers control concurrency, not priority.

GPU Fleet Auto-Scaling

Always-on baseline GPUs handle steady traffic. When demand spikes, the scaler provisions burst instances from EU cloud providers. Cooling-down instances backfill with batch work until their billing hour expires. Budget guards prevent runaway costs.

Complete Data Sovereignty

No third-party proxy, no external telemetry, no inference API that sees your prompts. Your models run on your GPUs — on-premise or at EU-headquartered providers with zero US CLOUD Act exposure. The gateway is a single Go binary you deploy and control.

GPU Fleet Dashboard

Real-time visibility into every GPU instance — utilization, temperature, throughput, cost rate, health status. Embedded admin UI with live queue depth, scaling timeline, billing breakdown, and per-tenant usage. Dynamic configuration without restarts.

Prometheus + Webhooks

Native Prometheus metrics for long-term analytics — request latency, token throughput, queue depth, GPU utilization, cost tracking. Configurable webhook alerts for Slack, PagerDuty, or any endpoint. Budget thresholds, health alerts, scaling notifications — all customizable at runtime.

Technical specifications

Language Go
API compatibility OpenAI chat/completions (streaming + non-streaming)
Supported engines vLLM, SGLang, TensorRT-LLM, Ollama, any OpenAI-compatible
Protocol HTTP/1.1 + SSE, gRPC (planned)
Deployment Embedded Go library, standalone Docker image, managed SaaS (planned)
Observability Prometheus metrics, webhook alerts, embedded dashboard
Scaling providers Hetzner, OVHcloud, Scaleway, Genesis Cloud
Authentication API key per tenant, mTLS between gateway and backends
Min. requirements Single-core, 128 MB RAM (gateway only, excl. inference engines)

Deployment modes

Embedded Library

Import as a Go module. Zero network overhead. The gateway runs in-process alongside your application.

go get lumavista.eu/meridian

Standalone Service

OpenAI-compatible API. Drop-in replacement for LiteLLM, OpenRouter, or any inference proxy. Single Docker image.

docker run meridian

Managed SaaS

We run it for you on EU infrastructure. Multi-tenant with per-key isolation. Pay per token plus platform fee.

Coming soon

Request early access

Get in Touch