NVIDIA L40S GPU's for MXFP4 quantization

#100

by lordim - opened Aug 9, 2025

Aug 9, 2025

Are NVIDIA L40S GPU's also compatible for MXFP4 quantization? I'm trying to load gpt-oss-20b on this machine, but it seems to default to bf16.

RichardDeetlefs

Aug 10, 2025

Did it work with bf16?

entfane

Aug 10, 2025

@lordim MXFP4 is supported by Blackwell architecture, so I don't think it is compatible. At least not natively.

Gerald001

Feb 17

hi, does bf16 work with nvidia l40s?

Ivborh

Feb 27

I tried gpt 20b oss on l40s and it works perfectly with MXFP4 for me now https://devforth.io/insights/self-hosted-gpt-real-response-time-token-throughput-and-cost-on-l4-l40s-and-h100-for-gpt-oss-20b/

Not nativeley as on H100, but latency and decoding speed is very nice on Marlin still, I would say L40S is one of the good options for cu currency / throughput / cost balance.

Gerald001

Feb 27

I tried gpt 20b oss on l40s and it works perfectly with MXFP4 for me now https://devforth.io/insights/self-hosted-gpt-real-response-time-token-throughput-and-cost-on-l4-l40s-and-h100-for-gpt-oss-20b/

Not nativeley as on H100, but latency and decoding speed is very nice on Marlin still, I would say L40S is one of the good options for cu currency / throughput / cost balance.

whats the latency - can you share numbers?

Ivborh

Feb 28

I tried gpt 20b oss on l40s and it works perfectly with MXFP4 for me now https://devforth.io/insights/self-hosted-gpt-real-response-time-token-throughput-and-cost-on-l4-l40s-and-h100-for-gpt-oss-20b/

Not nativeley as on H100, but latency and decoding speed is very nice on Marlin still, I would say L40S is one of the good options for cu currency / throughput / cost balance.

whats the latency - can you share numbers?

Sure, you can take a look on charts in my link above, it is visually easier to understand, but I will give some example points: in terms of TTFT on single request without concurrency it is 2-3 seconds for sequences like 30k tokens grows to 10s on context like 70k, and 30s for longest sequences closer to context window, decoding speed is near 150token/s at very short sequences and falls down to 100 token/s for longest. Important - this uses default max num batched tokens in vLLM which is 2Ki Tokens, you can bump it higher to even improve TTFT but with risk for OOM on long/concurrent sequences, or model will not start at all.
For concurrent users, please see charts.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment