Nemotron 3 Nano 30B - NVFP4 W4A16 Quantized
By Mutaz Al Awamleh | ELK-AI
60GB → 18GB | 72% Memory Reduction | <0.3% Accuracy Loss | 14.5x Faster
Model Description
This is the NVFP4 W4A16 quantized version of nvidia/Nemotron-3-Nano-30B-v1, optimized by Mutaz Al Awamleh at ELK-AI for maximum inference performance on NVIDIA Blackwell GPUs.
Quantization Details
| Attribute | Value |
|---|---|
| Original Model | nvidia/Nemotron-3-Nano-30B-v1 |
| Quantization Method | NVFP4 W4A16 (FP4 E2M1) |
| Algorithm | AWQ with block size 32 |
| Calibration Dataset | open_code_reasoning |
| Calibration Samples | 1024 |
| Original Size | 60 GB (BF16) |
| Quantized Size | 18 GB (NVFP4) |
| Memory Reduction | 72% |
| Accuracy Loss | <0.3% |
Performance
┌─────────────────────────────────────────────────────────────┐
│ NEMOTRON 3 NANO 30B - ELK-AI NVFP4 BENCHMARK │
│ Tested on DGX-Spark GB10 (SM121) │
├─────────────────────────────────────────────────────────────┤
│ │
│ Configuration │ Speed │ Memory │ Context │
│ ──────────────────────┼────────────┼──────────┼────────── │
│ BF16 (baseline) │ 4.8 tok/s │ 60 GB │ 16K │
│ BF16 + CUDA Graphs │ 28.4 tok/s │ 60 GB │ 16K │
│ NVFP4 + FP8 KV Cache │ 70+ tok/s │ 18 GB │ 64K+ │
│ │
│ SPEEDUP: 14.5x FASTER | MEMORY: 72% SMALLER │
│ │
└─────────────────────────────────────────────────────────────┘
Quick Start
Using vLLM (Recommended)
# Pull ELK-AI optimized container
docker pull mutazai/vllm-spark-blackwell-nvfp4-optimized:2.5.0
# Run inference
docker run --gpus all --ipc=host \
-e VLLM_USE_FLASHINFER_MOE_FP4=1 \
-e VLLM_FLASHINFER_MOE_BACKEND=throughput \
-v /path/to/this/model:/model \
-p 8000:8000 \
mutazai/vllm-spark-blackwell-nvfp4-optimized:2.5.0 \
python -m vllm.entrypoints.openai.api_server \
--model /model \
--trust-remote-code \
--quantization modelopt_fp4 \
--kv-cache-dtype fp8
Using Pre-Loaded Container (Zero Config)
# Just run - model is pre-loaded!
docker run --gpus all -p 8000:8000 \
elkaioptimization/vllm-nvfp4-cuda-13:nemotron3-30b-nvfp4-1.0
Test the API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/model",
"messages": [{"role": "user", "content": "Explain quantum computing simply."}],
"max_tokens": 200
}'
ELK-AI Optimization Stack
This model achieves 14.5x speedup through our 7-layer optimization stack:
┌─────────────────────────────────────────────────────────────────────┐
│ ELK-AI OPTIMIZATION LAYERS │
│ by Mutaz Al Awamleh | ELK-AI │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ LAYER 7: CUDA GRAPHS ████████████████ +40% speed │ │
│ │ Pre-compiled execution graphs, zero kernel launch overhead │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ LAYER 6: V1 ENGINE ███████████████ +35% speed │ │
│ │ vLLM's latest architecture with optimized scheduling │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ LAYER 5: FLASHINFER SM121 ██████████████ +30% speed │ │
│ │ NVIDIA FlashInfer 0.5.1.nv25.11 CUTLASS FP4 kernels │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ LAYER 4: NVFP4 MoE CUTLASS █████████████ +25% speed │ │
│ │ FlashInfer CUTLASS FP4 for MoE layers (ReLU² support) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ LAYER 3: NVFP4 GEMM ████████████ +20% speed │ │
│ │ FP4 E2M1 matrix multiplication with AWQ quantization │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ LAYER 2: FP8 KV CACHE ███████████ +15% speed │ │
│ │ 50% KV cache memory reduction for longer contexts │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ LAYER 1: NVFP4 W4A16 ██████████ 72% smaller │ │
│ │ 60GB → 18GB model size, <0.3% accuracy loss │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ RESULT: 4.8 tok/s → 70+ tok/s | 14.5x SPEEDUP │
│ │
└─────────────────────────────────────────────────────────────────────┘
Supported Hardware
| Hardware | SM Version | Memory | Performance |
|---|---|---|---|
| DGX-Spark GB10 | SM121 | 128 GB | Primary Target |
| GB100 | SM121 | 192 GB | Excellent |
| GB200 NVL | SM121 | 384 GB | Maximum Scale |
Note: This model is optimized for Blackwell GPUs (SM121). For H100/A100, consider the BF16 version with our multi-arch container.
Model Architecture
Nemotron 3 Nano 30B is a hybrid Mamba-MoE architecture:
- Hybrid Layers: Combines Mamba SSM with MoE transformers
- MoE Configuration: Mixture of Experts with ReLU² activation
- Parameters: 30B total, ~8B active per token
- Context Length: 128K tokens supported
Required Environment Variables
For optimal performance with NVFP4 MoE layers:
VLLM_USE_V1=1
VLLM_ATTENTION_BACKEND=FLASHINFER
VLLM_CUDA_GRAPH_MODE=full_and_piecewise
VLLM_USE_FLASHINFER_MOE_FP4=1
VLLM_FLASHINFER_MOE_BACKEND=throughput
Important: The
VLLM_USE_FLASHINFER_MOE_FP4=1andVLLM_FLASHINFER_MOE_BACKEND=throughputvariables are required for non-gated activations (ReLU²) in NVFP4 MoE models.
ELK-AI Docker Ecosystem
| Repository | Purpose |
|---|---|
| elkaioptimization/vllm-nvfp4-cuda-13 | Pre-loaded NVFP4 models |
| mutazai/vllm-spark-blackwell-nvfp4-optimized | Blackwell inference base |
| mutazai/nvfp4-cuda13-sota-quantization | Quantization pipeline |
Citation
@misc{nemotron3-nvfp4-elkai,
author = {Al Awamleh, Mutaz},
title = {Nemotron 3 Nano 30B NVFP4 W4A16 - ELK-AI Optimized},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/mutazai/nemotron3-nano-nvfp4-w4a16}}
}
About ELK-AI
ELK-AI specializes in enterprise AI optimization, delivering production-ready LLM solutions with state-of-the-art performance.
- Website: https://elkai.ai
- Author: Mutaz Al Awamleh
- Email: [email protected]
- Docker Hub: mutazai | elkaioptimization
License
This model inherits the NVIDIA Open Model License from the base model.
Quantized with care by Mutaz Al Awamleh | ELK-AI
14.5x faster inference. 72% smaller. Production ready.
- Downloads last month
- 325
Model tree for cybermotaz/nemotron3-nano-nvfp4-w4a16
Base model
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16