How to choose between serverless vs. dedicated containers for LLM hosting

Compare serverless vs. dedicated containers for LLM hosting. Learn how sandboxes complete the architecture when agents generate and run code.

10 min

You've deployed a coding agent that generates working Python scripts in development. Users start testing it and the experience breaks. The LLM takes a second to generate the first token. The execution environment needs three more seconds to spin up. So by the time the user sees output, four seconds have passed and the interaction feels broken.

The problem isn't the model. It's the infrastructure underneath it. Coding agents need two distinct layers working together: an inference layer that runs the LLM and an execution layer that runs the code the LLM generates. Most engineering teams spend weeks optimizing inference hosting and overlook the execution side entirely.

This guide compares serverless and dedicated infrastructure for hosting coding agent LLMs. It also covers the execution layer that completes the architecture when agents generate and run code in production.

How coding agents use LLMs differently

Coding agents don't use LLMs the same way chatbots do. A chatbot sends a prompt, gets a response, and displays it. A coding agent sends a prompt, gets code back, and executes that code in an isolated environment. Then it reads the result, and often sends another prompt based on what happened.

This generate-execute-observe-iterate loop creates two infrastructure requirements. The inference layer handles token generation. The execution layer runs the code safely and returns results. And both layers add latency to every interaction.

The inference side has its own complexity. Production coding agents use hierarchical model architectures. A frontier model like Claude or DeepSeek V3 handles reasoning and code generation. Meanwhile, smaller specialized models handle fast apply, which is the process of merging AI-generated edits into existing files. Morph and Relace process edits at 10,000+ tokens per second using purpose-built 3-8B parameter models.

NVIDIA’s analysis of real-world agent frameworks (MetaGPT, Open Operator, Cradle) finds that small language models (<10B parameters) can replace roughly 40–70% of LLM calls in these systems without degrading task performance.

The execution side handles everything the model produces. When an agent generates a Python script, something has to run it. When it edits a file, something has to apply the change and confirm it compiles. And when it installs a dependency, something has to execute that command in isolation. This layer needs security isolation, low latency, and often persistent state across interactions.

Choosing how to host the inference layer is only half the architecture decision.

Serverless infrastructure for coding agent LLMs

Serverless platforms handle scaling, patching, and capacity planning automatically. For coding agent inference, the tradeoff is clear: lower operational burden in exchange for cold start latency and resource constraints.

CPU inference on AWS Lambda

AWS Lambda can run models up to about 3B parameters using CPU inference via llama.cpp. Lambda provides up to 10 GB of memory with proportional CPU allocation and a 15-minute execution timeout.

At maximum configuration, a 1.5B 4-bit quantized model generates about 30 tokens per second. That's usable for code completion but 10–100x slower than GPU inference. For example, a setup with 30,000 inference calls per month at this configuration would cost approximately $91.

Cold starts create the primary challenge. Unoptimized Lambda functions deploying quantized 1.5B models take 16.68 seconds to initialize. Lambda SnapStart restores from cached Firecracker microVM memory snapshots and cuts that to 1.39 seconds. That's a 92% reduction, but still over a second before the first token.

This tier works for autocomplete, fill-in-the-middle suggestions, and fast apply sub-agents. It doesn't work for interactive coding sessions where users expect real-time feedback.

GPU inference on Google Cloud Run

Cloud Run GPU support reached general availability in June 2025 with NVIDIA L4 GPUs. L4 instances provide 24 GB VRAM, enough for models up to about 13B parameters. An RTX PRO 6000 option with 96 GB VRAM is in preview for larger models.

Cloud Run supports scale-to-zero. When no requests arrive, you pay nothing. The catch is cold start time. Google reports that spinning up from zero takes about 19 seconds total for a 4B model, including GPU initialization, model loading, and first inference. Smaller 2B quantized models achieve around 11 seconds.

At $0.67 per hour for L4 in Tier 1 regions (plus separate CPU and memory charges), a service peaking at two instances daily costs roughly $822 per month. For coding agents that need interactive latency, you'll likely keep minimum instances warm, which reduces the scale-to-zero cost advantage.

Inference API providers

For teams that don't want to manage inference infrastructure at all, API providers offer the lowest per-token costs. Groq achieves 840 tokens per second for Llama 3.1 8B at $0.05 per million input tokens. Cerebras reaches 2,957 tokens per second for larger models. Together AI hosts 200+ open-source models at $0.20 to $0.90 per million tokens.

The tradeoff is control. You can't fine-tune models, you depend on the provider's availability, and you send your users' code through a third party.

Dedicated container infrastructure for coding agent LLMs

When teams need full control over inference latency, model customization, or data residency, dedicated infrastructure becomes the path forward. This is where teams with consistent high-volume workloads and platform engineering capacity end up.

Inference engines and GPU requirements

Two open-source inference engines dominate production deployments. vLLM (UC Berkeley) introduced PagedAttention, which treats the key-value cache like paged virtual memory. It achieves 14–24x higher throughput than HuggingFace Transformers with near-zero memory waste.

SGLang (from the Chatbot Arena team) emphasizes fast prefix caching (RadixAttention) and reports higher throughput than vLLM on some multi‑turn workloads with Llama 3‑class models.

GPU requirements follow predictable scaling. A 7B parameter model needs about 17 GB VRAM at FP16, or roughly 4 GB at INT4 quantization. A 13B model needs about 31 GB at FP16. In one benchmark for a coding model, AWQ quantization with Marlin kernels delivered roughly 1.5–1.6x higher tokens per second than the FP16 baseline, with minimal quality loss.

When dedicated makes financial sense

H100 cloud prices fell sharply through 2025. On‑demand rates on major clouds dropped from roughly $7–11/hour into the $3–4/hour range, and specialist providers often fell between about $1.50–3/hour by late 2025.

The breakeven depends on utilization. Dedicated infrastructure becomes cheaper above roughly 8,000 conversations per day or 500,000 tokens per minute. At 10% utilization, self-hosted cost per token rises to match premium APIs.

Operational costs are real

Platform engineers command annual salaries of $150K–$200K+. Mid-sized deployments typically need one to three full-time engineers. That means you’ll spend $300K–$400K+ annually before infrastructure costs. GPUs commonly sit underutilized during off-peak periods. Without active management, costs spiral.

Dedicated infrastructure is the right choice when you have consistent volume, platform engineering talent, and specific requirements around model customization or data residency. For teams without those conditions, serverless or API providers avoid the operational overhead.

Comparing serverless vs. dedicated for coding agent inference

The choice between serverless and dedicated infrastructure depends on three factors: traffic volume, latency requirements, and team expertise.

FactorServerless hostingDedicated infrastructure hosting
Best use caseVariable traffic, small models (sub-3B on Lambda, sub-13B on Cloud Run GPU)Consistent high volume, custom models, data residency requirements
Cold start1.39s with SnapStart (Lambda), 11-19s from zero (Cloud Run GPU)None when pre-warmed
Cost modelPay per invocation or per secondFixed infrastructure plus engineering headcount
GPU accessLimited (Lambda: none, Cloud Run: single L4 or RTX PRO 6000)Full control over GPU type, count, and configuration
Operational burdenMinimalPlatform engineering team required
Cost breakevenCheaper below roughly 1M requests per month for bursty workloadsCheaper above roughly 8,000 conversations per day at 50%+ utilization

Both approaches solve the inference side of the problem. Neither addresses the other half of coding agent infrastructure: where the generated code actually runs.

The missing layer: code execution for coding agents

Choosing how to host your LLM is half the architecture decision. The other half is where generated code runs.

Every coding agent follows the same loop. The LLM generates code. Something executes it. Results feed back to the LLM. The LLM iterates. That "something" is the execution layer. It needs to handle untrusted code from a model that might generate anything from a simple print statement to a recursive directory deletion.

Production coding agents like Cursor, Bolt.new, Lovable, and Devin all implement this two-layer architecture. Cursor uses Firecracker microVMs on AWS for code execution alongside H100 clusters and API providers for inference. Bolt.new runs everything in-browser using WebContainers. Lovable uses separate sandbox infrastructure at scale.

The execution layer has four requirements that differ from inference hosting.

Security isolation

Generated code runs in environments where one user's execution can't access another user's data. For multi-tenant products, this means hardware-enforced boundaries, not shared-kernel containers. The Firecracker USENIX paper documents microVMs achieving less than 125ms boot time with less than 5 MiB memory overhead while running each workload in its own guest kernel.

Low latency

Users expect real-time feedback when code runs. If the execution environment takes three seconds to spin up every time the agent needs to run a script, the experience breaks regardless of how fast inference is.

State persistence

Coding agents work on projects, not isolated scripts. The execution environment needs to maintain cloned repositories, installed dependencies, and file system state across interactions. Rebuilding this state from scratch on every invocation adds minutes of overhead.

Proximity to inference

When the agent and its execution environment run on separate infrastructure, every tool call adds network latency. An agent making five tool calls per interaction accumulates 250–500ms of overhead from network round trips alone.

Complete your coding agent stack with a perpetual sandbox

Perpetual sandbox platforms address the execution layer specifically. They don't host LLMs. They run the code that LLMs generate.

When a sandbox finishes processing a request, it transitions to standby mode instead of shutting down. The platform captures a complete snapshot of the virtual machine, including memory, filesystem, and running processes. Compute charges drop to zero. When the next request arrives, the platform restores from that snapshot.

Perpetual sandbox platforms like Blaxel resume sandboxes from standby in under 25ms with exact previous state intact. Sandboxes remain in standby indefinitely with no compute cost, unlike competitors that delete after 30 days. The platform uses microVM isolation (the same technology as AWS Lambda) rather than containers, providing hardware-enforced security boundaries for multi-tenant deployments.

For coding agents specifically, Blaxel provides more than sandboxes. Agents Hosting co-locates agent logic alongside sandboxes on the same infrastructure to minimize network latency between the agent and its execution environment.

Teams building coding agents with Blaxel pair it with whatever inference strategy fits their scale. For example, you could use Cerebras, Groq or Together AI for inference at low volume. Or you might run vLLM on dedicated GPUs at high volume. Blaxel’s Model Gateway lets your agent connect to inference providers like Cerebras through a unified layer that handles telemetry, access control, and token cost control. The execution layer works the same regardless of where inference happens.

Sign up free to test sandbox performance with your coding agent, or explore the docs to see how the execution layer integrates with your inference stack.

FAQs about serverless vs. dedicated containers for LLM hosting

Do I need separate infrastructure for inference and code execution?

Yes. Inference infrastructure (GPU clusters, serverless endpoints, API providers) optimizes for token generation throughput. Code execution infrastructure optimizes for security isolation, state persistence, and low-latency process management. Production coding agents like Cursor and Devin run these as separate layers.

What models work best for coding agents?

The dominant pattern uses a frontier model (Claude, DeepSeek V3, GPT-4) for reasoning and code generation paired with smaller specialized models for fast apply. Qwen 2.5 Coder spans six sizes from 0.5B to 32B parameters, with the 32B variant matching GPT-4o on code editing benchmarks. For self-hosted inference, the 7B and 14B variants offer strong performance on a single GPU.

When should I use serverless vs. dedicated for the inference layer?

Serverless works well below roughly 1 million requests per month with variable traffic. Lambda handles sub-3B models via CPU inference. Cloud Run GPU supports larger models with scale-to-zero.

Dedicated infrastructure becomes more cost-effective above 8,000 conversations per day at consistent utilization, especially after the sharp drop in H100 pricing through 2025.

How does the execution layer affect total latency?

Execution layer latency adds directly to every tool call. An agent that makes five tool calls per interaction accumulates the execution environment's startup time multiplied by five. Traditional serverless platforms add two to five seconds per cold start. Perpetual sandbox platforms like Blaxel add under 25ms per resume from standby. The difference compounds across multi-step agent workflows.