NVIDIA L40S GPU's for MXFP4 quantization

#100
by lordim - opened

Are NVIDIA L40S GPU's also compatible for MXFP4 quantization? I'm trying to load gpt-oss-20b on this machine, but it seems to default to bf16.

Did it work with bf16?

@lordim MXFP4 is supported by Blackwell architecture, so I don't think it is compatible. At least not natively.

hi, does bf16 work with nvidia l40s?

I tried gpt 20b oss on l40s and it works perfectly with MXFP4 for me now https://devforth.io/insights/self-hosted-gpt-real-response-time-token-throughput-and-cost-on-l4-l40s-and-h100-for-gpt-oss-20b/

Not nativeley as on H100, but latency and decoding speed is very nice on Marlin still, I would say L40S is one of the good options for cu currency / throughput / cost balance.

I tried gpt 20b oss on l40s and it works perfectly with MXFP4 for me now https://devforth.io/insights/self-hosted-gpt-real-response-time-token-throughput-and-cost-on-l4-l40s-and-h100-for-gpt-oss-20b/

Not nativeley as on H100, but latency and decoding speed is very nice on Marlin still, I would say L40S is one of the good options for cu currency / throughput / cost balance.

whats the latency - can you share numbers?

I tried gpt 20b oss on l40s and it works perfectly with MXFP4 for me now https://devforth.io/insights/self-hosted-gpt-real-response-time-token-throughput-and-cost-on-l4-l40s-and-h100-for-gpt-oss-20b/

Not nativeley as on H100, but latency and decoding speed is very nice on Marlin still, I would say L40S is one of the good options for cu currency / throughput / cost balance.

whats the latency - can you share numbers?

Sure, you can take a look on charts in my link above, it is visually easier to understand, but I will give some example points: in terms of TTFT on single request without concurrency it is 2-3 seconds for sequences like 30k tokens grows to 10s on context like 70k, and 30s for longest sequences closer to context window, decoding speed is near 150token/s at very short sequences and falls down to 100 token/s for longest. Important - this uses default max num batched tokens in vLLM which is 2Ki Tokens, you can bump it higher to even improve TTFT but with risk for OOM on long/concurrent sequences, or model will not start at all.
For concurrent users, please see charts.

Sign up or log in to comment