How to set up a code execution environment for autonomous agents

Learn how to build a secure, fast, stateful code execution environment for autonomous agents. Architecture decisions, setup steps, and validation checks.

Nicolas Lecomte

Published April 23, 2026

11 min

You've built an agent that works in development. It parses documents, generates code, and executes it correctly. Then you deploy to production, and the first user interaction takes four seconds to respond. Every tool call waits on a network hop between the agent and its execution environment, and those hops stack.

Production agents that generate and run untrusted code need a dedicated execution environment that handles three things at once: isolation strong enough for untrusted LLM output, resume times fast enough to keep multi-step tool calls inside a sub-second budget, and persistent state that carries context across a session.

Without all three, the gap between "works in dev" and "works in production" stays open. This guide covers the architecture decisions you need upfront, a five-step setup sequence, and the validation checks that catch failures before your users do.

Why autonomous agents need a purpose-built code execution environment

Generic containers hit three constraints that make them inadequate for agents executing untrusted LLM output. Each constraint maps to a failure mode that production teams encounter repeatedly.

Untrusted code from a model bypasses kernel boundaries in containers. A PR review agent cloning a 50,000-file repo executes test suites generated by a model, and that code hasn't been reviewed by a human. Untrusted code attempts filesystem traversal, network exfiltration, and resource exhaustion. Containers share the host kernel across all tenants, and escape vulnerabilities are documented in sources such as this CNCF overview. A single exploit on a shared kernel reaches every other workload on that host.

Real-time agent loops can't tolerate multi-second cold starts. A coding assistant streaming a live preview needs to spin up an environment, run code, and return results before the user perceives a delay. Jakob Nielsen's research establishes 100 milliseconds as the ceiling for a system feeling instantaneous. Docker containers can take hundreds of milliseconds at p50 even for a minimal service. That's already five times the instantaneous-response budget, before any code runs.

Multi-step tasks break when state resets between invocations. A research agent running analysis scripts on a loaded dataset needs that dataset in memory across tool calls. If the environment resets between invocations, the agent reloads data every time, and a two-second task becomes a 30-second one. Ephemeral environments waste compute and degrade user experience by forcing context rebuilds. These three constraints shape the architecture decisions covered next.

Architecture decisions before you write any code

These are upfront choices that become expensive to reverse once deployed. Getting them wrong means re-architecting under load.

Pick an isolation model

Container-based isolation shares the host kernel across all workloads. Containers start fast, but the shared kernel means a runtime vulnerability can grant access to the host and every other tenant. The Cloud Native Computing Foundation (CNCF) security whitepaper explicitly recommends against running "disparate data-sensitive workloads" on the same OS kernel. Containers work for trusted code your team wrote. For LLM-generated code, the risk profile changes.

MicroVMs use the same isolation approach as the technology behind AWS Lambda on shared fleets. Firecracker, an open-source microVM technology, exposes only five emulated devices and treats all vCPU threads as running malicious code from the start. Hardware-enforced isolation removes a category of risk that namespace separation cannot.

Teams building PR review agents and code generation tools consistently choose microVM isolation once they reach production. The security audit that approves shared-kernel execution for LLM-generated code is the audit that hasn't been completed yet.

Decide on execution topology

Remote sandboxes place the agent in one region and the sandbox in another. Each tool call crosses a network boundary. AWS measures inter-region round-trip times between 60 and 193 milliseconds. That's pure network overhead before any code executes.

Co-located execution places the agent and sandbox on the same infrastructure. Same-availability-zone round-trip times drop to sub-millisecond with enhanced networking. The difference compounds across sequential tool calls.

An agent making five tool calls at 100 milliseconds of overhead per call adds 500 milliseconds of pure network latency. None of that time goes to processing. The same five calls co-located add only a few milliseconds. For coding assistants and live previews, this gap determines whether a turn fits a sub-second budget or pushes into multi-second territory.

Choose a state persistence model

State persistence falls into three tiers, each matched to a different workload pattern.

Ephemeral execution: A fresh environment starts per call with no state carried forward. Works for one-shot code execution where each invocation is independent. Any large dataset or dependency tree loads from scratch every time.
Persistent sandboxes: The filesystem and memory persist between calls via standby mode. When the agent's next tool call arrives, the environment resumes with everything intact. Fits multi-turn coding sessions where the agent reads files, edits them, runs tests, and iterates.
Guaranteed long-term storage: Persistent volumes hold data that must survive beyond the sandbox's lifecycle. Research tasks spanning days or weeks need volumes to retain datasets across sandbox restarts and deletions.

Persistence only pays off when resume from standby is fast enough. Otherwise, you pay for idle compute while waiting on the next call. A sandbox that takes seconds to resume forces a choice between keeping it running or accepting a cold start penalty. Fast resume from standby eliminates that trade-off entirely.

How to set up the code execution environment

These five steps go in order. Each builds on the one before it. By the end, you'll have a working environment the agent can call in production.

1. Define the agent's execution contract

The execution contract specifies the interface between the agent and the sandbox. Pin this down first. It determines which frameworks integrate cleanly and how the agent communicates with its environment.

The agent sends a code payload, the target language, a timeout, input data or environment variables, and any files the sandbox needs. The sandbox returns stdout, stderr, files written, an exit code, and execution duration. Every tool call follows this shape.

Here's a minimal example using TypeScript:

// The agent's tool-call contract
const result = await sandbox.process.exec({
  name: "run-analysis",
  command: "python3 /app/analysis.py",
  waitForCompletion: true,
  timeout: 30000
});

// What comes back
console.log(result.stdout);   // execution output
console.log(result.stderr);   // error output
console.log(result.exitCode); // 0 = success

When a tool call exceeds its timeout, the sandbox kills the process and returns a non-zero exit code with stderr output. The agent should check exitCode before using stdout. This prevents downstream hallucination from partial or missing output.

This contract maps cleanly to common agent frameworks and custom frameworks. The sandbox also exposes a Model Context Protocol (MCP) server so agents that support MCP can discover and call filesystem and process tools dynamically. Define the contract before choosing a framework. Building it around a specific framework's abstractions means rewriting the integration layer if you switch.

2. Provision the sandbox runtime

You have two options: a managed microVM platform or self-hosted Firecracker on Kernel-based Virtual Machines (KVM). The managed path ships a working sandbox in minutes. Self-hosting is viable but demands dedicated infrastructure engineering, which is why most production teams take the managed route.

Start with Blaxel, the perpetual sandbox platform, through its TypeScript SDK:

typescript

import { SandboxInstance } from "@blaxel/core";

const sandbox = await SandboxInstance.createIfNotExists({
  name: "coding-agent-sandbox",
  image: "blaxel/base-image:latest",
  memory: 4096,
  region: "us-pdx-1"
});

Six lines, and the platform handles kernel tuning, jailer configuration, snapshot storage, and networking. MicroVM isolation and production-ready defaults ship out of the box. Agent features reach users within a quarter instead of after a multi-quarter infrastructure build. The trade-off is vendor dependency.

Self-hosting Firecracker remains an option for teams with the engineering bandwidth to own it. It runs on bare metal, or on virtual machines with nested virtualization. KVM cannot run on standard VMs without it. Beyond procurement, the operational surface is substantial:

SMT and KSM configuration: Disable Simultaneous Multithreading (SMT) and Kernel Samepage Merging (KSM) for multi-tenant security.
Per-VM networking: Manage TAP (network tunnel) device lifecycles and IP pools across every instance.
Seccomp validation: Validate seccomp profiles against each Firecracker release.
Snapshot management: Build a snapshot management system from scratch, since Firecracker provides no functionality to package or manage snapshots on the host, though it does expose snapshot creation and load primitives.

Cursor built a dedicated Rust orchestrator, Anyrun, to launch agents with process isolation on Amazon EC2 and Firecracker. Expect months of work and ongoing maintenance to reach comparable ground.

3. Configure the sandbox lifecycle

Three lifecycle parameters control cost and responsiveness. Set them explicitly rather than relying on defaults.

Startup timeout caps how long the platform waits for the sandbox to become ready. Blaxel sandboxes resume from standby in under 25 milliseconds, so keep startup timeouts short: five to ten seconds. Longer timeouts mask provisioning failures that should surface immediately.

Idle-to-standby transition window determines how quickly the sandbox suspends after the last network activity. Blaxel transitions sandboxes to standby after 15 seconds of inactivity. Competitor defaults vary, with some platforms documenting session or idle timeouts measured in 15 minutes, 30 minutes, or longer. That means paying for idle compute between tool calls. A short window prevents runaway billing while preserving state, since the sandbox resumes from standby quickly on the next call.

Maximum standby duration controls how long the sandbox retains state. For agents that run across days or weeks, perpetual standby removes the need to rebuild state. Set TTLs at creation for sandboxes that should clean up automatically:

const sandbox = await SandboxInstance.create({
  name: "research-sandbox",
  image: "blaxel/base-image:latest",
  memory: 4096,
  region: "us-pdx-1",
  lifecycle: {
    expirationPolicies: [
      { type: "ttl-idle", value: "60d", action: "delete" }
    ]
  }
});

For data that must persist beyond the sandbox's lifecycle, use volumes. Standby preserves state while the sandbox exists, but deletion erases everything. Volumes guarantee long-term retention.

4. Co-locate the agent with the sandbox

This step operationalizes the topology decision from earlier. Deploying the agent's reasoning loop on the same infrastructure as the sandbox eliminates inter-region network overhead on every tool call. For an agent making multiple remote calls per user turn, remote topology adds tens to hundreds of milliseconds per call, depending on network distance, connection reuse, and service handoffs. Co-located infrastructure cuts that overhead to sub-millisecond levels.

The implementation: host the agent as a serverless endpoint. It calls the sandbox through a local SDK rather than a public network URL. The agent binds to HOST and PORT environment variables injected by the hosting platform and connects to the sandbox's MCP server at the sandbox's base URL (https://<SANDBOX_BASE_URL>/mcp).

// Agent hosted on Blaxel Agents Hosting
// Sandbox created on the same infrastructure
const sandbox = await SandboxInstance.createIfNotExists({
  name: "user-session-sandbox",
  image: "blaxel/base-image:latest",
  memory: 4096,
  region: "us-pdx-1",
});

// MCP server available at sandbox.metadata.url
// No public network hop, agent and sandbox share infrastructure
const mcpUrl = `${sandbox.metadata.url}/mcp`;

Blaxel's Agents Hosting paired with its Sandbox product is one concrete way to do this. The platform co-locates agents and sandboxes on the same infrastructure to eliminate network hops. You can build equivalent co-location on your own infrastructure using availability zone pinning and private networking.

The goal is the same regardless of approach: keep the agent and sandbox on the same network segment. For teams running coding assistants or live preview tools, co-location converts network overhead from the dominant latency contributor into a rounding error.

5. Instrument observability from day one

Set up OpenTelemetry-based tracing that follows an agent turn end-to-end. The OpenTelemetry GenAI semantic conventions define a structured span model for this. Each LLM call gets a span named chat {model}. Each sandbox invocation gets a span named execute_tool {tool_name}.

Some implementation guides depict agent traces with an "Agent Run" root span and child spans, but the semantic conventions themselves define only individual agent operation spans (such as create_agent and invoke_agent) without mandating a root hierarchy.

The two most useful early metrics are p50 and p90 resume latency from standby. These tell you whether the sandbox performs as expected under real load. Log per-tool-call latency, sandbox resume times, and lifecycle state transitions. Regressions surface before users notice.

Manually instrument sandbox tool calls. Auto-instrumentation exists in several observability ecosystems, though coverage for specialized code execution environments varies. Each execute_tool span should capture gen_ai.tool.name and gen_ai.tool.call.id. Here's what manual span creation looks like for a sandbox tool call:

const span = tracer.startSpan("execute_tool run_python", {
  kind: SpanKind.INTERNAL,
  attributes: {
    "gen_ai.operation.name": "execute_tool",
    "gen_ai.tool.name": "run_python",
    "gen_ai.tool.type": "function",
    "gen_ai.tool.call.id": toolCallId,
  },
});

Gate sensitive attributes like gen_ai.tool.call.arguments (the submitted code) through an OpenTelemetry Collector processor before export. Configure a transform or attributes processor in the Collector pipeline to hash or strip code content from spans before they reach your backend. These conventions carry Development stability status, so pin your instrumentation library versions and plan for schema migrations.

How to validate the setup before production

Three validation checks catch the failures that show up between a working prototype and a reliable production system. Run all three before taking traffic.

Latency under concurrent load

Measure p50 and p90 resume time at 100 or more parallel sandboxes, not just sequential. Google's SRE team recommends deliberately overloading your system to see how it degrades. Load testing will happen in production if you skip it. Ramp concurrent requests beyond pool capacity and watch the p90 transition from pool-hit latency to snapshot-restore latency to cold-boot latency. The degradation curve, not the nominal p50, is the primary validation signal. A sharp cliff indicates pool exhaustion with no graceful fallback.

A gradual slope means the system degrades predictably and can be capacity-planned against. Document the curve at launch and rerun the same test after each infrastructure change. Capacity regressions caught in staging cost hours. The same regressions found in production cost incidents.

State recovery after extended standby

Confirm the sandbox resumes with filesystem contents, environment variables, and running processes intact after extended standby. Check that unique identifiers regenerate correctly across multiple sandboxes restored from the same base snapshot, since duplicate tokens or IDs across restored instances can create cross-tenant exposure if not handled. Resume idempotency matters too. Sending repeated resume signals to an already-running sandbox should produce no errors and no state corruption.

Failure isolation across tenants

Verify that a runaway process or out-of-memory (OOM) kill in one sandbox doesn't affect neighbors. For microVM-based platforms, the hardware boundary handles this by design. To validate, inject memory exhaustion into one sandbox and confirm that memory event counters in neighboring sandboxes remain unchanged.

Test CPU isolation separately: inject a CPU-saturating process in one sandbox and verify no measurable CPU steal in neighboring workloads. These three checks validate latency, correctness, and security. Passing all three means the environment is ready for production traffic.

Build a production-ready code execution environment for autonomous agents

Getting the execution environment right determines whether agents feel like production software or a demo that happens to be online. Latency ceilings lose users in the first session, and isolation choices decide whether a single exploit reaches customer data. Both lock in at setup, so fixing them later means re-architecting under load.

Blaxel, the perpetual sandbox platform, packages these decisions into one stack. Sandboxes resume from standby in under 25 milliseconds on microVM isolation, and Agents Hosting co-locates the agent's reasoning loop with its execution environment on the same network segment. Teams skip the infrastructure build and focus on agent logic. Book a technical walkthrough at blaxel.ai/contact, or start building with free credits at app.blaxel.ai.

Frequently asked questions

What latency is acceptable for an autonomous agent's code execution environment?

Jakob Nielsen's 100-millisecond instantaneous-response threshold applies here. Agents making five to ten tool calls per turn multiply that budget quickly. For real-time use cases like coding assistants and live previews, keep per-call resume below the instantaneous-response threshold. Background or batch workloads tolerate seconds of startup without degrading user experience, since no user is waiting on a streaming response.

Can I use Docker containers instead of a microVM-based sandbox?

Containers work for the trusted code your team wrote. For LLM-generated code executing untrusted input, containers carry escape risk because they share the host kernel across all tenants. MicroVMs provide hardware-enforced isolation using the same approach as AWS Lambda. The runc container escape CVEs documented through 2025 show this is a structural risk in the shared-kernel model, which is why production platforms executing agent-generated code typically rely on hypervisor-isolated runtimes.

How do I handle agent tasks that run longer than the sandbox timeout?

Two patterns cover most cases. Use asynchronous endpoints (up to ten minutes on most managed platforms) when the user waits for a final result but not a streaming response. Use batch jobs for fan-out processing, scheduled analysis, or anything the agent triggers and checks on later. Blaxel's Batch Jobs support up to 24 hours per job and scale to thousands of parallel machines.

Does each agent session need its own sandbox, or can they share?

Multi-tenant scenarios need per-user or per-session isolation to prevent state leaks and contain security incidents. Shared runtimes create both problems. Perpetual standby makes per-session sandboxes practical because each one sits dormant without active compute charges between tool calls. The sandbox suspends when idle and resumes in under 25 milliseconds on the next call, preserving full state. Isolation no longer forces a choice between security and infrastructure spend. Blaxel Agent Drive, a distributed filesystem with concurrent read/write acces, allows sharing data across multiple sandboxes in real-time.

What languages should my code execution environment support?

Python and TypeScript are the most common languages used with LLM-driven code generation today. Most managed sandbox platforms support both natively. Go is common for SDK-level interaction with the infrastructure itself. For Ruby, Java, or Rust workloads, plan on custom image templates for your sandbox base. Check language coverage on your target platform before committing to a specific runtime.

COMPUTE

STORAGE

CO-HOSTING

Get started for free