vLLM for NVIDIA DGX Spark / GB10 Blackwell (SM121)

vLLM container for DGX Spark - 2.5x faster than llama.cpp

Keywords: vLLM DGX Spark, NVIDIA GB10, Blackwell GPU, SM121, ARM64 LLM, CUDA 13.1, Qwen3 inference

🏆 Performance Results

Model	llama.cpp	vLLM (this)	Improvement
Qwen3-Coder-30B-A3B-FP8	21 tok/s	44 tok/s	2.1x faster
Qwen3-Next-80B-A3B-FP8	18 tok/s	45 tok/s	2.5x faster

📦 Quick Start

Option 1: Docker Hub (Recommended)

Pull and run directly from Docker Hub:

# Pull the image (7.3GB compressed)
docker pull hellohal2064/vllm-dgx-spark-gb10:latest

# Run with your model
docker run --rm --gpus all \
  -v /path/to/models:/models \
  -p 8000:8000 \
  -e MODEL_PATH=/models/your-model \
  -e MAX_MODEL_LEN=131072 \
  -e GPU_MEMORY_UTIL=0.90 \
  hellohal2064/vllm-dgx-spark-gb10:latest

Option 2: HuggingFace Download

Download the pre-built image tarball from this repository:

# Download image tarball from this repo
wget https://huggingface.co/Hellohal2064/vllm-dgx-spark-gb10/resolve/main/docker/vllm-gb10-latest.tar.gz

# Load the image
gunzip -c vllm-gb10-latest.tar.gz | docker load

# Run with your model
docker run --rm --gpus all \
  -v /path/to/models:/models \
  -p 8000:8000 \
  -e MODEL_PATH=/models/your-model \
  -e MAX_MODEL_LEN=131072 \
  -e GPU_MEMORY_UTIL=0.90 \
  vllm-gb10:latest

Option 3: Build from Source

Clone this repository and build locally:

# Clone and build
git clone https://huggingface.co/Hellohal2064/vllm-dgx-spark-gb10
cd vllm-dgx-spark-gb10
docker build -t vllm-gb10:latest .

🔧 Environment Variables

Variable	Default	Description
MODEL_PATH	(required)	Path to model inside container
MAX_MODEL_LEN	32768	Maximum context window (see below)
GPU_MEMORY_UTIL	0.90	GPU memory utilization (0.0-1.0)
ATTENTION_BACKEND	TRITON_ATTN	Attention implementation
TENSOR_PARALLEL	1	Tensor parallelism degree

📏 Context Window Configuration

The MAX_MODEL_LEN environment variable controls the maximum context window size. The DGX Spark's 119GB unified memory allows for very large context windows:

Context Size	MAX_MODEL_LEN	Approx. Words	Memory Impact
8K tokens	8192	~6,000	Minimal
32K tokens	32768	~24,000	Low
64K tokens	65536	~50,000	Moderate
128K tokens	131072	~100,000	High
256K tokens	262144	~200,000	Very High

Example with 256K context window:

docker run --rm --gpus all \
  -v /data/models:/models \
  -p 8000:8000 \
  -e MODEL_PATH=/models/Qwen3-Next-80B-A3B-FP8 \
  -e MAX_MODEL_LEN=262144 \
  -e GPU_MEMORY_UTIL=0.95 \
  hellohal2064/vllm-dgx-spark-gb10:latest

Trade-offs:

Larger context = more GPU memory for KV cache
Larger context = fewer concurrent requests possible
With 119GB on DGX Spark, you can run 256K+ context with 80B models

🛠️ Technical Details

Build Environment

Base: nvcr.io/nvidia/cuda:13.1.0-devel-ubuntu24.04
PyTorch: 2.11.0.dev20260103+cu130 (nightly with SM121 support)
vLLM: 0.1.dev1+g0d4044edd (patched for GB10)
Architecture: ARM64 (aarch64)

Hardware Compatibility

Primary Target: NVIDIA DGX Spark with GB10 Blackwell GPU
CUDA Compute: SM121 (Blackwell)
Memory: Supports unified memory architecture (119GB on DGX Spark)

Key Optimizations

Native SM121 CUDA kernels (not emulated)
TRITON_ATTN backend (FlashInfer doesn't support SM121 yet)
MoE expert configurations for GB10
CPU parallelization (OMP_NUM_THREADS=20)
Full CUDA graph optimization (no --enforce-eager)
Chunked prefill enabled for long context efficiency

📁 Repository Contents

Dockerfile - Build instructions
docker-compose.yml - Production deployment config
entrypoint.sh - Flexible startup script
docker/vllm-gb10-latest.tar.gz - Pre-built image (7.24GB compressed)

🧪 Tested Models

Model	Parameters	Performance	Memory	Max Context Tested
Qwen3-Coder-30B-A3B-FP8	30B (3B active)	44 tok/s	~27GB	128K
Qwen3-Next-80B-A3B-FP8	80B (8B active)	45 tok/s	~42GB	256K

🤝 Contributing

Issues and PRs welcome! This is a community effort to bring vLLM to the DGX Spark platform.

📄 License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track