Skip to content
· 10 min read

How to Build a Research Agent Stack with Open-Source Models

By LumaVista Team

You wouldn’t hire one person to do strategy, research, analysis, writing, and fact-checking. You’d build a team. Each person would have different strengths, different costs, and a different role. Some need to think deeply. Others need to move fast.

AI research agents work the same way. The model that’s brilliant at planning a research strategy is terrible at quickly evaluating search results. The model that writes gorgeous prose burns through tokens at 10x the cost when you use it for simple reranking. And the model with a 10-million-token context window? It’s overkill for generating search queries.

If you’re building a self-hosted research pipeline — whether for legal analysis, market research, or scientific literature review — the question isn’t “which model should I use?” It’s “which models, plural, and where?”

Using one model for everything is like hiring one person to do strategy, research, analysis, writing, and fact-checking. You need a team of specialists, not a single generalist.

Six roles, six different needs

A typical research agent pipeline has distinct stages, and each stage has radically different requirements. Here’s what actually matters for each one:

The Planner breaks a research question into subtasks, decides what to search for, and coordinates the whole operation. It needs strong instruction following, structured output, and the ability to decompose complex goals. Speed doesn’t matter much — the planner runs once at the start. Quality matters enormously, because a bad plan wastes every downstream step.

The Searcher generates queries, evaluates whether results are relevant, and decides when to dig deeper. It needs to be fast — really fast. In a typical research run, the searcher makes dozens of calls. Latency here directly multiplies into total pipeline time. It doesn’t need frontier reasoning; it needs good judgment at speed.

The Reasoner does the hard thinking. Given a pile of search results and context, it analyzes, compares, finds contradictions, and draws conclusions. This is where you want your heaviest model — the one with the highest scores on graduate-level science and math benchmarks. It runs fewer times but each call matters more.

The Writer synthesizes everything into a coherent report. It needs fluent prose, good structure, and the ability to handle citations without hallucinating sources. Reasoning models are actually bad at this — they produce verbose chain-of-thought output instead of clean paragraphs.

The Aggregator combines results from multiple parallel research threads, deduplicates findings, and resolves conflicts. Context window size is critical here — it needs to hold all the intermediate results simultaneously. A 128K context model might work for small jobs; bigger pipelines need 256K or more.

The Reranker scores relevance — is this search result actually useful for answering this question? It needs to be tiny, fast, and cheap. You’re calling it hundreds of times per research run. A 14B model is plenty.

Six distinct roles arranged in a flowing pipeline — small fast nodes on the edges, large glowing cores in the center, connected by warm golden threads against a dark background

The model picks

Here’s where it gets concrete. For each role, these are the open-weight models that actually work well — tested, benchmarked, and available for self-hosting today.

Planner: Qwen 3.5 or Mistral Large 3

Your planner needs to be smart, not fast. Two models stand out:

Qwen 3.5 (397B total, 17B active) has a hybrid architecture that toggles between thinking and non-thinking modes. When it needs to deeply decompose a research question, it engages chain-of-thought reasoning. When it needs to output a structured JSON plan, it switches to direct mode. It scored 91.3% on AIME 2026 — among the best open models for complex reasoning. Apache 2.0 licensed.

Mistral Large 3 (675B total, 41B active) is the EU sovereignty pick. Built by a French company, Apache 2.0 licensed, with strong instruction following and structured output. It doesn’t match Qwen 3.5 on raw reasoning benchmarks, but it’s reliable and predictable — which matters more for planning than peak intelligence.

Budget option: Llama 3.3 70B at INT4 quantization (~43 GB VRAM). It matches the planning capability of models 5x its size because planning is more about instruction following than raw intelligence.

Searcher: Mistral Small 3.2 or Qwen 3.5-9B

Speed is everything here. Your searcher runs 50-100 times per research session. Every 100ms of latency adds up.

Mistral Small 3.2 (24B dense) hits 190 tokens per second — among the fastest open models. It fits on a single GPU at INT4 (~14 GB), has strong function calling for tool use, and handles query generation reliably. EU-origin, Apache 2.0. This is the default choice.

Qwen 3.5-9B is surprisingly capable for its size. It matches GPT-OSS-120B on GPQA Diamond (81.7 vs 71.5) despite being 13x smaller. At 9B parameters, it runs on consumer hardware and leaves VRAM headroom for other models.

Reasoner: DeepSeek R1-0528 or GLM-4.7

This is where you spend your compute budget. The reasoner’s quality directly determines your output quality.

DeepSeek R1-0528 is purpose-built for reasoning. It scored 87.5% on AIME 2025 and 81.0% on GPQA Diamond (graduate-level science). The 0528 update cut hallucination by 45-50% compared to the original R1. The catch: it’s from a Chinese lab, and each response averages 23K tokens of chain-of-thought. Budget for the token cost.

GLM-4.7 (355B total, 32B active) has the highest single-benchmark scores among open models: 95.7% AIME, 86.0% GPQA, 73.8% SWE-bench. It’s MIT-licensed. The downside is a smaller community and sparser self-hosting documentation compared to DeepSeek or Qwen.

Budget option: DeepSeek R1-Distill-32B at INT4 (~24 GB). It’s a distilled version of the full R1, and it performs comparably to OpenAI’s o1-mini on reasoning tasks. Runs on a single RTX 4090.

Three luminous orbs of different sizes floating in warm amber space — the smallest sharp and fast, the medium one detailed and structured, the largest deep and radiant with internal complexity

Writer: Mistral Large 3 or Llama 3.3 70B

Writing quality is subjective, but some models consistently produce cleaner prose than others.

Mistral Large 3 writes well. The French AI tradition has always prioritized language quality, and it shows — Mistral models produce natural, well-structured text with good paragraph flow. Its 260K context window handles long reports without truncation.

Llama 3.3 70B benefits from the largest fine-tuning ecosystem in open-source AI. There are writing-optimized fine-tunes available, and the base model already produces clean, direct prose. It can share VRAM with your planner since they rarely run simultaneously.

Important: Don’t use reasoning models (R1, R1-0528) for writing. They produce verbose chain-of-thought output — “Let me think about this step by step…” — instead of the clean paragraphs your report needs.

The reasoner will consume 10-50x more tokens than the searcher per session. When budgets are tight, the reasoner is where model trade-offs matter most.

Aggregator: Llama 4 Scout or Maverick

The aggregator’s job is to hold everything in memory at once and synthesize it. Context window size is the constraint.

Llama 4 Scout (109B total, 17B active) has a 10-million-token context window — industry-leading. In practice, you won’t use anywhere near that, but having headroom means you never have to worry about truncating intermediate results. It fits on a single H100 at INT4 (~65 GB).

Llama 4 Maverick (400B total, 17B active) offers 1M context with stronger reasoning than Scout. If your aggregation needs to resolve contradictions between sources (not just concatenate them), Maverick’s extra expert diversity pays off.

Budget option: Llama 3.3 70B handles aggregation fine for small-to-medium research tasks. Its 128K context is sufficient when you’re combining 5-10 research threads rather than 50.

Reranker: Phi-4 or Mistral Small 3.2

Reranking is the one role where tiny models win. You’re calling the reranker hundreds of times — every search result gets scored. Cost per call matters more than peak intelligence.

Phi-4 (14B dense) fits in 7 GB at INT4. It has strong reasoning for its size (93.7% GSM8K) and MIT license. Microsoft-origin, clean sovereignty profile. This is the default choice.

For pure relevance scoring, consider dedicated cross-encoder models like BGE Reranker or Jina Reranker instead of generative LLMs. They’re faster and better calibrated for the specific task of “is this result relevant to this query?”

Three budget tiers

Not everyone has a rack of H100s. Here’s how to build a research agent stack at three different budget levels:

Tier 1: EU sovereign (2x H100 nodes, ~€46K/month)

RoleModelVRAMWhy
PlannerMistral Large 3~340 GB INT4EU-origin, strong planning
SearcherMistral Small 3.2~14 GB INT4Fast, EU-origin
ReasonerQwen 3 235B-A22B~120 GB INT4Best reasoning at medium sovereignty risk
WriterMistral Large 3(shared)Strong prose, 260K context
AggregatorLlama 4 Scout~65 GB INT410M context window
RerankerMistral Small 3.2(shared)Fast inference

Every model touching user intent (planner, writer) is EU-origin. The reasoner accepts medium sovereignty risk for quality — the model weights are static artifacts running on your hardware, with no data flowing back to China.

Three models covering six roles on a single H100 — the budget tier proves you do not need a rack of GPUs to run a capable research pipeline.

Tier 2: Max quality (3-4x H100 nodes, ~€92K/month)

RoleModelWhy
PlannerQwen 3.5 397B-A17BBest planning with thinking mode
SearcherMistral Small 3.2Speed over quality for search
ReasonerGLM-4.7 or DeepSeek R1-0528Highest reasoning benchmarks
WriterMistral Large 3Best prose quality
AggregatorLlama 4 Maverick1M context + strong reasoning
RerankerQwen 3.5-9BBest quality/speed ratio

This configuration prioritizes output quality over sovereignty concerns. GLM-4.7 hits 95.7% on AIME — you’re not getting that from any EU-origin model yet.

Tier 3: Budget (1x H100 or 2x RTX 4090, ~€4.6K/month)

RoleModelVRAM
Planner + Writer + AggregatorLlama 3.3 70B INT4~43 GB
Searcher + RerankerMistral Small 3.2 INT4~14 GB
ReasonerDeepSeek R1-Distill-32B INT4~24 GB

Three models covering six roles. The planner, writer, and aggregator share Llama 3.3 70B (they don’t run concurrently). The searcher and reranker share Mistral Small 3.2. Total unique VRAM: ~81 GB. Fits on a single H100 with room to spare, or two RTX 4090s with careful scheduling.

A constellation of warm golden nodes at three different scales — a dense cluster of many small bright points, a medium arrangement of fewer larger spheres, and a single massive luminous core — representing the three budget tiers

The inference engine question

The model is only half the equation. The inference engine — the software that loads the model and serves predictions — makes the difference between “works in a notebook” and “handles production traffic.”

vLLM is the default choice. It has the broadest model support, PagedAttention for memory efficiency, and continuous batching for high concurrency. If you’re running multiple models and need them all to coexist on the same GPU cluster, start here.

SGLang is better for MoE models — particularly DeepSeek’s Multi-head Latent Attention architecture. If your stack includes DeepSeek V3, R1, or Qwen models with mixture-of-experts, SGLang’s RadixAttention and prefix caching give you better throughput.

Ollama is for single-GPU setups running small models. If your reranker is Phi-4 on a consumer GPU, Ollama is the easiest path from download to API endpoint. Don’t use it for multi-GPU production serving.

TensorRT-LLM is the NVIDIA-specific option. If you’re running H100s or B200s and want maximum hardware utilization, TensorRT’s kernel fusion and FP8 support extract performance that generic engines can’t match. The trade-off is vendor lock-in and a steeper learning curve.

For most research agent stacks: vLLM for the big models, Ollama for the small ones. Add SGLang if you’re running DeepSeek MoE models.

What to do now

  1. Pick your budget tier. Be honest about your GPU budget and operational capacity. Tier 3 with three models covers most research use cases adequately.

  2. Start with one model, not six. Run Llama 3.3 70B across all roles first. Measure where it’s too slow (searcher), where it’s not smart enough (reasoner), and where it’s fine. Then specialize.

  3. Benchmark your actual workload. Synthetic benchmarks (MMLU, AIME, HumanEval) measure general capability, not your specific use case. Run your research questions through the candidate models and compare outputs.

  4. Quantize aggressively. INT4 quantization cuts VRAM by 4x with minimal quality loss for most roles. Only your reasoner — where every percentage point matters — might justify FP8 or FP16.

  5. Set up vLLM first. Install it, load your first model, and get a working API endpoint. Everything else builds on top of a functioning inference server.

  6. Share models across roles. Your planner and writer don’t run at the same time. Neither do your searcher and reranker. Put them on the same model and save half your VRAM.

  7. Monitor token costs per role. The reasoner will consume 10-50x more tokens than the searcher per research session. If your budget is tight, the reasoner is where you make model trade-offs.

  8. Watch the landscape. Llama 4 Behemoth, GLM-5, and the OpenEuroLLM initiative are all expected in the next few months. The model you pick today might not be the best choice in six months — design your pipeline to swap models without rewriting code.