The Full Stack for Running AI on Private Data

You got the green light. The board approved an AI initiative, the budget landed, and now it’s on you to make it happen — on your own infrastructure, with your own data, without sending a single prompt to someone else’s API. Simple enough, right?

You spin up a GPU instance, load an open-weight model, and get inference running in an afternoon. It works. Then legal asks how you’re handling GDPR. Security wants to know about network isolation. Your CISO asks about key management. Ops asks what happens when the GPU node goes down at 2 AM. And suddenly “run a model” turned into “build and operate a nine-layer infrastructure stack.”

That stack is what this article is about. Not the theory — the actual architecture. What you need, why you need it, and which tools to evaluate. If you’re the person responsible for making private AI work in production, this is your checklist.

For the business case and risk framing behind each of these layers, see What It Actually Takes to Run AI on Your Own Data.

1. Inference hosting

Everything starts here. If you can’t serve model predictions reliably, nothing else matters.

The core decision is which inference engine to run. vLLM has become the default for most production deployments — it handles continuous batching, PagedAttention for efficient memory use, and supports most open-weight model architectures out of the box. Text Generation Inference (TGI) from Hugging Face is the main alternative, with tighter integration into the Hugging Face ecosystem but less flexibility on batching strategies.

For hardware, you’re looking at NVIDIA A100s (80GB) as the workhorse or H100s if you need the throughput. The model you choose determines the GPU footprint. A 70B-parameter model like Llama 3 70B needs roughly 140GB of VRAM at full precision — that’s two A100s minimum. Quantization changes the math: GPTQ or AWQ can compress a 70B model to fit on a single 80GB card with acceptable quality loss for most enterprise use cases.

Orchestration depends on your scale. Kubernetes with the NVIDIA GPU Operator handles scheduling, health checks, and multi-node inference cleanly. If latency is your primary constraint and you’re running a small number of models, bare-metal deployments avoid the container overhead entirely.

Don’t skip model selection. Open-weight models (Llama 3, Mistral, Qwen 2.5) give you full control — no data leaves your network, no usage telemetry phones home, and you can fine-tune on your domain data. API-only models (GPT-4, Claude) are faster to deploy but fundamentally incompatible with “data never leaves your building.”

Open-weight models give you full control — no data leaves your network, no usage telemetry phones home, and you can fine-tune on your domain data. API-only models are fundamentally incompatible with keeping data in your building.

2. Cloud instance management

You’ve got inference running. Now you need to keep it running without burning through your GPU budget in a month.

GPU instances are expensive. An 8xA100 node costs roughly €23 per hour on-demand — that’s €16,500 per month if you leave it running. Most enterprise AI workloads don’t need 24/7 inference at peak capacity. You need autoscaling that matches capacity to actual demand: scale up during business hours, scale down at night, and handle batch processing jobs on spot instances where interruptions are tolerable.

Infrastructure-as-code isn’t optional at this scale. Terraform or Pulumi should define every instance, network, and storage volume. When your CISO asks “can you reproduce this environment in our DR region,” the answer needs to be “yes, in 20 minutes,” not “let me check what we configured manually.”

The multi-cloud vs. single-cloud vs. on-prem decision matters more here than in traditional infrastructure. GPU availability is genuinely constrained. You might run primary inference on-prem, burst to AWS for peak demand, and keep a warm standby in GCP. That’s three sets of networking, identity, and monitoring to maintain — which is why most organizations start single-cloud and add complexity only when they hit real capacity ceilings.

Cold start mitigation is the hidden cost. Loading a 70B model into GPU memory takes 2-5 minutes depending on storage throughput. If your autoscaler spins up a new node, users wait. Pre-warming strategies — keeping a minimal replica alive, caching model weights on local NVMe — are worth the baseline cost.

3. Network protection

Your inference cluster is now a high-value target. It has access to your private data, it runs on expensive hardware, and if compromised, an attacker gets both your data and a free GPU cluster.

Start with zero-trust networking. Every service-to-service connection uses mTLS — mutual TLS, where both sides present certificates. Not just “the client authenticates to the server,” but “every hop proves its identity.” If your API gateway talks to the inference engine, both sides verify certificates before a single byte moves.

Network segmentation is the next layer. Your inference cluster should live in its own network segment, isolated from your corporate network and your production databases. The only paths in are through your API gateway. The only paths out are to your logging and monitoring stack. If someone compromises a developer’s laptop, they shouldn’t be able to reach the inference cluster at all.

Your API gateway — Traefik, Kong, or NGINX — handles rate limiting, request validation, and payload size limits. A single unbounded request can tie up a GPU for minutes. Rate limiting per user and per API key prevents both abuse and accidental denial-of-service from a runaway integration.

For hybrid deployments where on-prem connects to cloud, you need private links or site-to-site VPN tunnels. Public internet should never carry inference traffic containing private data — even encrypted, the metadata exposure isn’t worth the risk.

DDoS mitigation sits at the edge. Even if your inference cluster is internal, your API gateway has a public surface. Cloudflare, AWS Shield, or on-prem scrubbing appliances handle volumetric attacks before they reach your infrastructure.

A secure mesh cocoon surrounding a cluster of bright nodes — controlled golden pathways are the only way through the protective boundary

4. Data encryption

This is the compliance baseline. Without encryption at rest and in transit, you fail GDPR Article 32, HIPAA Security Rule, SOC 2 Type II, and ISO 27001 before you even get to the interesting questions.

Encryption in transit means TLS 1.3 on every connection — API clients to gateway, gateway to inference, inference to storage, logging pipelines, monitoring endpoints, everything. Certificate rotation should be automated. If you’re manually rotating certificates, you will forget, and you’ll discover it when production breaks.

Encryption at rest means AES-256 for all stored data — model weights, prompt logs, response caches, fine-tuning datasets. But here’s the part people miss: you also need to encrypt prompts and responses before they hit disk. If your inference engine writes a swap file or a debug log, that log contains your users’ private data in cleartext unless you’ve handled it at the application layer.

Key management is where the complexity lives. HashiCorp Vault is the most common choice for on-prem deployments — it handles key generation, rotation, and access policies. AWS KMS works if you’re single-cloud. For regulated industries (banking, healthcare, government), you may need a hardware security module (HSM) that stores keys in tamper-resistant hardware.

Per-tenant key management is worth the effort. If each customer’s data is encrypted with a separate key, a compromised key exposes one tenant, not all of them. It also makes data deletion provable — destroy the key, and the data is cryptographically irrecoverable. Your GDPR Article 17 (right to erasure) compliance just got a lot simpler.

Per-tenant encryption makes data deletion provable — destroy the key, and the data is cryptographically irrecoverable. GDPR Article 17 compliance just got a lot simpler.

5. Access control

Encryption protects data at the storage layer. Access control determines who reaches that layer in the first place.

Start with RBAC — role-based access control — with least-privilege defaults. Engineers who deploy models shouldn’t automatically see production query logs. Data scientists who fine-tune models don’t need access to the inference API keys. Define roles by function, not by seniority.

Identity integration means connecting to your existing corporate IdP via SAML or OIDC. Your employees shouldn’t need a separate set of credentials for the AI platform. Single sign-on means your existing MFA, password policies, and offboarding flows apply automatically. When someone leaves the company, their AI platform access dies with their corporate account.

API key management gets overlooked until it doesn’t. Every integration — your CRM connector, your document pipeline, your internal chatbot — needs its own API key with scoped permissions. Keys should rotate on a schedule (90 days is common). And you need a way to revoke a specific key instantly without breaking every other integration — which means key-per-integration, not a shared master key.

Audit logging is the piece that makes all of this provable. Every query to every model should log: who made the request (user identity), what model they hit, what data they sent (or a hash of it), when, and from where. These logs must be immutable — written to append-only storage that no administrator can quietly edit. When your SOC 2 auditor asks “can you show me who accessed customer data through the AI system last Tuesday,” you need to answer in minutes, not weeks.

Per-model access policies add a final layer. Your general-purpose assistant model might be available to everyone. Your model fine-tuned on financial data should only be accessible to the finance team. Policy-as-code (Open Policy Agent or Cedar) lets you define these rules declaratively and audit them as part of your CI/CD pipeline.

Translucent amber gates of varying sizes filtering golden silhouettes — selective access that knows who belongs

6. External service connectors and filters

Your AI system doesn’t live in a vacuum. It connects to email, CRM, document stores, databases, ticketing systems — and every one of those connections is a potential path for data to leak in or out.

Inbound filtering means every piece of external content passes through a safety layer before it reaches your AI system. An email integration that feeds customer messages to your AI assistant should strip PII, validate content types, and reject anything that looks like a prompt injection attempt before the model ever sees it. This isn’t a nice-to-have — one unfiltered connector is one path for an attacker to manipulate your model’s behavior.

Outbound guards are the mirror image. When your AI system generates a response that goes to an external service — an automated email reply, a CRM note, a Slack message — that output passes through a filter that checks for data leakage. Did the model accidentally include another customer’s information in its response? Did it surface internal system details that shouldn’t be externally visible? Outbound filtering catches these before they become incidents.

Behavioral analysis sits on top of both. You’re watching patterns, not just individual requests. A user who suddenly queries every customer record alphabetically isn’t doing normal work — they’re extracting data. A connector that starts receiving payloads ten times larger than usual might be under attack. Anomaly detection on query patterns, volume, and content gives you early warning before the damage is done.

The architecture that makes this work is a pluggable adapter pattern with mandatory filter middleware. Every connector — regardless of what it connects to — passes through the same inbound/outbound filter chain. New connectors inherit the security posture automatically. No shortcuts, no “we’ll add filtering later.”

Every connector passes through the same inbound/outbound filter chain. New connectors inherit the security posture automatically. No shortcuts, no deferred filtering.

7. Fraud and abuse detection

Filters catch known bad patterns. Fraud detection catches the patterns you didn’t anticipate.

Query pattern analysis is your first line. You’re looking for behaviors that don’t match legitimate use: systematic enumeration (querying every employee’s salary one by one), bulk extraction (downloading an entire knowledge base through repeated prompts), and privilege probing (testing what the model will reveal about other users’ data).

Rate limiting needs to be smarter than requests-per-second. Per-user and per-role limits prevent a compromised account from doing disproportionate damage. A customer support agent making 50 queries per hour is normal. The same agent making 5,000 queries at 3 AM is not — even if each individual request looks legitimate.

Cost anomaly detection is the financial backstop. GPU time isn’t free, and unusual consumption spikes can indicate either abuse (someone running a side business on your inference cluster) or compromise (an attacker using your GPUs for their own workloads). Alert on per-user and per-project cost deviations from baseline.

Output watermarking adds traceability. If your model’s outputs end up somewhere they shouldn’t — posted publicly, shared with competitors, used in litigation — watermarking lets you trace the output back to the specific user, session, and timestamp that generated it. Invisible watermarks embedded in the text structure don’t affect readability but survive copy-paste and light editing.

Golden sentinel orbs hovering above a river of data threads, one glowing brighter as it detects an anomalous pattern below

8. Bias detection and explainability

Running AI safely on private data isn’t just about security and uptime. You also need to prove that the system is fair, and you need to show your work.

Bias monitoring means checking model outputs for demographic disparities. If your AI-powered hiring screener recommends interviews for 80% of male candidates and 40% of female candidates with identical qualifications, you have a problem that no amount of encryption will fix. Demographic parity checks, equalized odds metrics, and regular bias audits should run on production outputs, not just test data.

Explainability is the ability to answer “why did the model say that?” SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) are the standard tools — they decompose a model’s output into the contribution of each input feature. For transformer-based language models, attention visualization can show which parts of the input the model focused on.

The EU AI Act makes this mandatory for high-risk systems — now effective December 2, 2027 for Annex III systems, after the May 2026 Digital Omnibus postponed the high-risk requirements. If your AI system influences hiring, credit, insurance, or legal decisions, you need documented risk classification, human oversight mechanisms, and explainability reporting. Article 14 requires human-in-the-loop capabilities. Article 13 requires transparency that lets users understand how the system works.

GDPR (Article 22) and CCPA both give individuals the right to an explanation when automated decisions significantly affect them. If a customer asks why your AI denied their loan application, “the model said so” isn’t a compliant answer. You need to produce a meaningful explanation that references specific input factors.

The audit trail ties it together. Every decision the model makes should be traceable: the inputs that went in, the model version that processed them, the output that came out, and the explanation that was generated. This chain of evidence is what makes compliance provable rather than aspirational.

9. SLA and reliability

If AI powers customer-facing workflows, it needs production-grade reliability. Not “best effort.” Not “usually works.” Documented SLA with measurable targets.

Start with latency targets. Inference latency varies dramatically by model size and hardware — a 7B model on an H100 can hit sub-100ms p50, while a 70B model on A100s might be 2-3 seconds p50. Set p50, p95, and p99 targets per model, and build your autoscaling around the p95. Users tolerate occasional slow responses; they don’t tolerate consistent ones.

High availability means no single point of failure. Multi-replica inference across availability zones, health checks that remove unhealthy nodes in seconds (not minutes), and automatic failover that doesn’t require human intervention. Your load balancer should know the difference between “GPU out of memory” (restart the pod) and “node hardware failure” (migrate the workload).

Your monitoring stack — Prometheus for metrics, Grafana for dashboards — needs model-specific metrics beyond standard infrastructure telemetry. Tokens per second, queue depth, batch utilization, GPU memory pressure, cache hit rates. These are the signals that predict problems before users notice them. A dropping tokens-per-second metric with stable request volume means your model is degrading — maybe a memory leak, maybe thermal throttling.

Disaster recovery for AI infrastructure has a unique wrinkle: model weights. A 70B model’s weights are ~140GB. If your primary inference cluster dies and you need to restore from backup, you’re bottlenecked by storage throughput, not compute availability. Keep model weights in object storage (S3, MinIO) with cross-region replication. Keep your entire infrastructure configuration in version control so you can rebuild the environment from scratch.

Capacity planning means load testing with realistic prompt distributions, not synthetic benchmarks. Your production traffic has long prompts and short ones, simple questions and complex chain-of-thought reasoning. A benchmark that measures throughput on uniform 100-token prompts will dramatically overestimate your real-world capacity.

What to do now

Audit your current stack against these nine layers. Most organizations have three or four covered and are hoping nobody asks about the other five.
Classify your data by sensitivity tier. Not everything needs the same level of protection, and trying to apply maximum security uniformly will blow your budget.
Run a threat model on your AI-specific attack surface. Traditional threat models miss prompt injection, model extraction, and inference-time data leakage.
Establish encryption and access control first. These are the compliance baseline — without them, you’re exposed before you even get to the interesting problems.
Build monitoring before you need it. The worst time to set up observability is during an incident.
Plan for explainability from day one. Retrofitting audit trails and bias monitoring onto a running system is painful and expensive.
Consider managed on-prem. Building and operating all nine layers in-house requires deep expertise across security, ML infrastructure, networking, and compliance. LumaVista packages the entire stack — inference hosting through SLA monitoring — and deploys it on your infrastructure. Your data never leaves your network, and you don’t need to hire nine different specialists to keep it running. Learn more about LumaVista on-premise deployment →