NVIDIA Releases ProRL Agent — Decoupled Rollout-as-a-Service Infrastructure for Training Multi-Turn LLM Agents via Reinforcement Learning at Scale

NVIDIA researchers have released ProRL Agent, a scalable infrastructure designed to fundamentally change how reinforcement learning training works for multi-turn LLM agents. Published on March 27, 2026 (arxiv: 2603.18815), the system introduces a 'Rollout-as-a-Service' architecture that decouples the I/O-intensive environment interactions from GPU-intensive policy training — solving a critical bottleneck in current agent development pipelines.
The core problem ProRL Agent addresses: existing RL training frameworks (SkyRL, VeRL-Tool, Agent Lightning, rLLM, GEM) embed rollout control directly within the training process. This tight coupling creates resource conflicts between I/O-bound rollouts (sandbox creation, tool sessions, async coordination) and GPU-bound training (forward/backward passes, gradient sync), reducing hardware efficiency and creating maintenance barriers.
ProRL Agent operates as a standalone HTTP service with a three-stage asynchronous pipeline: INIT (spin up sandbox containers, configure tools), RUN (drive multi-turn agent loop, collect trajectories), and EVAL (score results against ground truth for reward signals). Each stage runs on independent worker pools, allowing phases to overlap across different jobs.
Key technical innovations include: Singularity-based sandboxing for rootless HPC execution (critical for shared Slurm clusters), efficient bash via ptyprocess reducing shell command latency from 0.78s to 0.42s, direct IPython API connections eliminating network overhead, and Unix Domain Sockets replacing TCP loopback for intra-container communication.
For training optimization, the system implements min-heap load balancing across LLM inference backends to maximize prefix cache reuse, token-in/token-out communication to prevent re-tokenization drift, and asynchronous Dynamic Sampling Policy Optimization (DAPO) with early termination of redundant jobs.
Results on SWE-bench Verified show consistent improvements: Qwen3-4B improved from 14.8% to 21.2%, Qwen3-8B from 9.6% to 18.0%, and Qwen3-14B from 15.4% to 23.6%. Scalability tests confirmed near-linear throughput increase as compute nodes are added.
Sources
🧠 Stay Updated on AI Agents
Get weekly insights on agentic AI, networks and infrastructure. No spam.
Join 500+ AI builders. Unsubscribe anytime.