vLLM for NVIDIA DGX Spark / GB10 Blackwell (SM121)

vLLM container for DGX Spark - 2.5x faster than llama.cpp

Keywords: vLLM DGX Spark, NVIDIA GB10, Blackwell GPU, SM121, ARM64 LLM, CUDA 13.1, Qwen3 inference

πŸ† Performance Results

Model llama.cpp vLLM (this) Improvement
Qwen3-Coder-30B-A3B-FP8 21 tok/s 44 tok/s 2.1x faster
Qwen3-Next-80B-A3B-FP8 18 tok/s 45 tok/s 2.5x faster

πŸ“¦ Quick Start

Option 1: Docker Hub (Recommended)

Pull and run directly from Docker Hub:

# Pull the image (7.3GB compressed)
docker pull hellohal2064/vllm-dgx-spark-gb10:latest

# Run with your model
docker run --rm --gpus all \
  -v /path/to/models:/models \
  -p 8000:8000 \
  -e MODEL_PATH=/models/your-model \
  -e MAX_MODEL_LEN=131072 \
  -e GPU_MEMORY_UTIL=0.90 \
  hellohal2064/vllm-dgx-spark-gb10:latest

Option 2: HuggingFace Download

Download the pre-built image tarball from this repository:

# Download image tarball from this repo
wget https://huggingface.co/Hellohal2064/vllm-dgx-spark-gb10/resolve/main/docker/vllm-gb10-latest.tar.gz

# Load the image
gunzip -c vllm-gb10-latest.tar.gz | docker load

# Run with your model
docker run --rm --gpus all \
  -v /path/to/models:/models \
  -p 8000:8000 \
  -e MODEL_PATH=/models/your-model \
  -e MAX_MODEL_LEN=131072 \
  -e GPU_MEMORY_UTIL=0.90 \
  vllm-gb10:latest

Option 3: Build from Source

Clone this repository and build locally:

# Clone and build
git clone https://huggingface.co/Hellohal2064/vllm-dgx-spark-gb10
cd vllm-dgx-spark-gb10
docker build -t vllm-gb10:latest .

πŸ”§ Environment Variables

Variable Default Description
MODEL_PATH (required) Path to model inside container
MAX_MODEL_LEN 32768 Maximum context window (see below)
GPU_MEMORY_UTIL 0.90 GPU memory utilization (0.0-1.0)
ATTENTION_BACKEND TRITON_ATTN Attention implementation
TENSOR_PARALLEL 1 Tensor parallelism degree

πŸ“ Context Window Configuration

The MAX_MODEL_LEN environment variable controls the maximum context window size. The DGX Spark's 119GB unified memory allows for very large context windows:

Context Size MAX_MODEL_LEN Approx. Words Memory Impact
8K tokens 8192 ~6,000 Minimal
32K tokens 32768 ~24,000 Low
64K tokens 65536 ~50,000 Moderate
128K tokens 131072 ~100,000 High
256K tokens 262144 ~200,000 Very High

Example with 256K context window:

docker run --rm --gpus all \
  -v /data/models:/models \
  -p 8000:8000 \
  -e MODEL_PATH=/models/Qwen3-Next-80B-A3B-FP8 \
  -e MAX_MODEL_LEN=262144 \
  -e GPU_MEMORY_UTIL=0.95 \
  hellohal2064/vllm-dgx-spark-gb10:latest

Trade-offs:

  • Larger context = more GPU memory for KV cache
  • Larger context = fewer concurrent requests possible
  • With 119GB on DGX Spark, you can run 256K+ context with 80B models

πŸ› οΈ Technical Details

Build Environment

  • Base: nvcr.io/nvidia/cuda:13.1.0-devel-ubuntu24.04
  • PyTorch: 2.11.0.dev20260103+cu130 (nightly with SM121 support)
  • vLLM: 0.1.dev1+g0d4044edd (patched for GB10)
  • Architecture: ARM64 (aarch64)

Hardware Compatibility

  • Primary Target: NVIDIA DGX Spark with GB10 Blackwell GPU
  • CUDA Compute: SM121 (Blackwell)
  • Memory: Supports unified memory architecture (119GB on DGX Spark)

Key Optimizations

  • Native SM121 CUDA kernels (not emulated)
  • TRITON_ATTN backend (FlashInfer doesn't support SM121 yet)
  • MoE expert configurations for GB10
  • CPU parallelization (OMP_NUM_THREADS=20)
  • Full CUDA graph optimization (no --enforce-eager)
  • Chunked prefill enabled for long context efficiency

πŸ“ Repository Contents

  • Dockerfile - Build instructions
  • docker-compose.yml - Production deployment config
  • entrypoint.sh - Flexible startup script
  • docker/vllm-gb10-latest.tar.gz - Pre-built image (7.24GB compressed)

πŸ§ͺ Tested Models

Model Parameters Performance Memory Max Context Tested
Qwen3-Coder-30B-A3B-FP8 30B (3B active) 44 tok/s ~27GB 128K
Qwen3-Next-80B-A3B-FP8 80B (8B active) 45 tok/s ~42GB 256K

🀝 Contributing

Issues and PRs welcome! This is a community effort to bring vLLM to the DGX Spark platform.

πŸ“„ License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support