vLLM for NVIDIA DGX Spark / GB10 Blackwell (SM121)
vLLM container for DGX Spark - 2.5x faster than llama.cpp
Keywords: vLLM DGX Spark, NVIDIA GB10, Blackwell GPU, SM121, ARM64 LLM, CUDA 13.1, Qwen3 inference
π Performance Results
| Model | llama.cpp | vLLM (this) | Improvement |
|---|---|---|---|
| Qwen3-Coder-30B-A3B-FP8 | 21 tok/s | 44 tok/s | 2.1x faster |
| Qwen3-Next-80B-A3B-FP8 | 18 tok/s | 45 tok/s | 2.5x faster |
π¦ Quick Start
Option 1: Docker Hub (Recommended)
Pull and run directly from Docker Hub:
# Pull the image (7.3GB compressed)
docker pull hellohal2064/vllm-dgx-spark-gb10:latest
# Run with your model
docker run --rm --gpus all \
-v /path/to/models:/models \
-p 8000:8000 \
-e MODEL_PATH=/models/your-model \
-e MAX_MODEL_LEN=131072 \
-e GPU_MEMORY_UTIL=0.90 \
hellohal2064/vllm-dgx-spark-gb10:latest
Option 2: HuggingFace Download
Download the pre-built image tarball from this repository:
# Download image tarball from this repo
wget https://huggingface.co/Hellohal2064/vllm-dgx-spark-gb10/resolve/main/docker/vllm-gb10-latest.tar.gz
# Load the image
gunzip -c vllm-gb10-latest.tar.gz | docker load
# Run with your model
docker run --rm --gpus all \
-v /path/to/models:/models \
-p 8000:8000 \
-e MODEL_PATH=/models/your-model \
-e MAX_MODEL_LEN=131072 \
-e GPU_MEMORY_UTIL=0.90 \
vllm-gb10:latest
Option 3: Build from Source
Clone this repository and build locally:
# Clone and build
git clone https://huggingface.co/Hellohal2064/vllm-dgx-spark-gb10
cd vllm-dgx-spark-gb10
docker build -t vllm-gb10:latest .
π§ Environment Variables
| Variable | Default | Description |
|---|---|---|
| MODEL_PATH | (required) | Path to model inside container |
| MAX_MODEL_LEN | 32768 | Maximum context window (see below) |
| GPU_MEMORY_UTIL | 0.90 | GPU memory utilization (0.0-1.0) |
| ATTENTION_BACKEND | TRITON_ATTN | Attention implementation |
| TENSOR_PARALLEL | 1 | Tensor parallelism degree |
π Context Window Configuration
The MAX_MODEL_LEN environment variable controls the maximum context window size. The DGX Spark's 119GB unified memory allows for very large context windows:
| Context Size | MAX_MODEL_LEN | Approx. Words | Memory Impact |
|---|---|---|---|
| 8K tokens | 8192 | ~6,000 | Minimal |
| 32K tokens | 32768 | ~24,000 | Low |
| 64K tokens | 65536 | ~50,000 | Moderate |
| 128K tokens | 131072 | ~100,000 | High |
| 256K tokens | 262144 | ~200,000 | Very High |
Example with 256K context window:
docker run --rm --gpus all \
-v /data/models:/models \
-p 8000:8000 \
-e MODEL_PATH=/models/Qwen3-Next-80B-A3B-FP8 \
-e MAX_MODEL_LEN=262144 \
-e GPU_MEMORY_UTIL=0.95 \
hellohal2064/vllm-dgx-spark-gb10:latest
Trade-offs:
- Larger context = more GPU memory for KV cache
- Larger context = fewer concurrent requests possible
- With 119GB on DGX Spark, you can run 256K+ context with 80B models
π οΈ Technical Details
Build Environment
- Base: nvcr.io/nvidia/cuda:13.1.0-devel-ubuntu24.04
- PyTorch: 2.11.0.dev20260103+cu130 (nightly with SM121 support)
- vLLM: 0.1.dev1+g0d4044edd (patched for GB10)
- Architecture: ARM64 (aarch64)
Hardware Compatibility
- Primary Target: NVIDIA DGX Spark with GB10 Blackwell GPU
- CUDA Compute: SM121 (Blackwell)
- Memory: Supports unified memory architecture (119GB on DGX Spark)
Key Optimizations
- Native SM121 CUDA kernels (not emulated)
- TRITON_ATTN backend (FlashInfer doesn't support SM121 yet)
- MoE expert configurations for GB10
- CPU parallelization (OMP_NUM_THREADS=20)
- Full CUDA graph optimization (no --enforce-eager)
- Chunked prefill enabled for long context efficiency
π Repository Contents
- Dockerfile - Build instructions
- docker-compose.yml - Production deployment config
- entrypoint.sh - Flexible startup script
- docker/vllm-gb10-latest.tar.gz - Pre-built image (7.24GB compressed)
π§ͺ Tested Models
| Model | Parameters | Performance | Memory | Max Context Tested |
|---|---|---|---|---|
| Qwen3-Coder-30B-A3B-FP8 | 30B (3B active) | 44 tok/s | ~27GB | 128K |
| Qwen3-Next-80B-A3B-FP8 | 80B (8B active) | 45 tok/s | ~42GB | 256K |
π€ Contributing
Issues and PRs welcome! This is a community effort to bring vLLM to the DGX Spark platform.
π License
Apache 2.0