NVIDIA GreenBoost: Open-Source Linux Module Transparently Extends GPU VRAM Using System RAM and NVMe for LLM Inference

On March 18-19, 2026, NVIDIA published GreenBoost, an open-source Linux kernel module and CUDA userspace shim that transparently extends GPU VRAM by leveraging system DDR4 RAM and NVMe storage. The project quickly trended on Hacker News with 386+ points and active discussion.
THREE-TIER MEMORY ARCHITECTURE:
GreenBoost implements a three-tier memory hierarchy for GPU workloads:
- Tier 1 (Native VRAM): Allocations under 256MB stay in GPU VRAM at native bandwidth (e.g., 1 TB/s on RTX 4090)
- Tier 2 (System RAM): Overflow routes to system DDR4 RAM via PCIe 4.0, delivering approximately 32 GB/s bandwidth
- Tier 3 (NVMe Storage): Further overflow taps NVMe storage at approximately 1.8 GB/s
The system intercepts CUDA memory allocation calls transparently β applications do not need any code modifications. Inference software like Ollama, vLLM, and llama.cpp can run models that exceed GPU VRAM without any changes to their codebase.
PERFORMANCE CHARACTERISTICS:
Byteiota tested GreenBoost and found:
- Tier 2 (System RAM) is roughly 10-30x slower than native VRAM for bandwidth-bound operations
- Tier 3 (NVMe) adds another order of magnitude of latency
- However, for LLM inference with its sequential token generation pattern, the practical impact is less severe than raw bandwidth numbers suggest
- The system is designed for cases where running the model at reduced speed is preferable to not running it at all
PRACTICAL IMPACT:
GreenBoost enables several important use cases for AI agent infrastructure:
-
Running 70B+ Parameter Models on Consumer GPUs: A single RTX 4090 with 24GB VRAM can now run models that require 40-80GB by spilling to 128GB of system RAM
-
Local Agent Development: Developers building AI agents can test with full-size models locally instead of relying on cloud APIs, reducing costs and improving iteration speed
-
Edge Deployment: Combined with NVIDIA NemoClaw (announced at GTC 2026), GreenBoost enables running larger OpenClaw agent models on edge hardware with limited VRAM
-
Cost Reduction: Organizations can defer expensive GPU upgrades by utilizing existing system RAM and fast NVMe storage
HACKER NEWS DISCUSSION HIGHLIGHTS:
The community discussion revealed several technical insights:
- AMD already supports a similar feature through TTM (Translation Table Manager) kernel parameters
- Linux kernel 6.x includes improved Unified Memory support that complements GreenBoost
- Some users reported successfully running Llama 3 70B on a single RTX 3090 using GreenBoost with 256GB system RAM
- Concerns were raised about memory pressure on the OS when large amounts of system RAM are allocated to GPU overflow
GreenBoost is available on the NVIDIA Developer Forums and GitHub, with installation instructions for Ubuntu 22.04+ and kernel 5.15+. It supports CUDA 12.0+ and all RTX 30/40/50 series GPUs.
This release aligns with NVIDIA broader GTC 2026 strategy of democratizing AI inference β making it possible to run increasingly large models on existing hardware rather than requiring continuous GPU upgrades.
Sources
π§ Stay Updated on AI Agents
Get weekly insights on agentic AI, networks and infrastructure. No spam.
Join 500+ AI builders. Unsubscribe anytime.