Nemotron 3 Nano 30B - NVFP4 W4A16 Quantized

By Mutaz Al Awamleh | ELK-AI

ELK-AI NVFP4 Blackwell CUDA

60GB → 18GB | 72% Memory Reduction | <0.3% Accuracy Loss | 14.5x Faster


Model Description

This is the NVFP4 W4A16 quantized version of nvidia/Nemotron-3-Nano-30B-v1, optimized by Mutaz Al Awamleh at ELK-AI for maximum inference performance on NVIDIA Blackwell GPUs.

Quantization Details

Attribute Value
Original Model nvidia/Nemotron-3-Nano-30B-v1
Quantization Method NVFP4 W4A16 (FP4 E2M1)
Algorithm AWQ with block size 32
Calibration Dataset open_code_reasoning
Calibration Samples 1024
Original Size 60 GB (BF16)
Quantized Size 18 GB (NVFP4)
Memory Reduction 72%
Accuracy Loss <0.3%

Performance

┌─────────────────────────────────────────────────────────────┐
│         NEMOTRON 3 NANO 30B - ELK-AI NVFP4 BENCHMARK        │
│              Tested on DGX-Spark GB10 (SM121)               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Configuration         │ Speed      │ Memory   │ Context   │
│  ──────────────────────┼────────────┼──────────┼────────── │
│  BF16 (baseline)       │ 4.8 tok/s  │ 60 GB    │ 16K       │
│  BF16 + CUDA Graphs    │ 28.4 tok/s │ 60 GB    │ 16K       │
│  NVFP4 + FP8 KV Cache  │ 70+ tok/s  │ 18 GB    │ 64K+      │
│                                                             │
│  SPEEDUP: 14.5x FASTER | MEMORY: 72% SMALLER               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Quick Start

Using vLLM (Recommended)

# Pull ELK-AI optimized container
docker pull mutazai/vllm-spark-blackwell-nvfp4-optimized:2.5.0

# Run inference
docker run --gpus all --ipc=host \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  -v /path/to/this/model:/model \
  -p 8000:8000 \
  mutazai/vllm-spark-blackwell-nvfp4-optimized:2.5.0 \
  python -m vllm.entrypoints.openai.api_server \
    --model /model \
    --trust-remote-code \
    --quantization modelopt_fp4 \
    --kv-cache-dtype fp8

Using Pre-Loaded Container (Zero Config)

# Just run - model is pre-loaded!
docker run --gpus all -p 8000:8000 \
  elkaioptimization/vllm-nvfp4-cuda-13:nemotron3-30b-nvfp4-1.0

Test the API

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model",
    "messages": [{"role": "user", "content": "Explain quantum computing simply."}],
    "max_tokens": 200
  }'

ELK-AI Optimization Stack

This model achieves 14.5x speedup through our 7-layer optimization stack:

┌─────────────────────────────────────────────────────────────────────┐
│                    ELK-AI OPTIMIZATION LAYERS                       │
│                  by Mutaz Al Awamleh | ELK-AI                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ LAYER 7: CUDA GRAPHS          ████████████████  +40% speed  │   │
│  │ Pre-compiled execution graphs, zero kernel launch overhead  │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ LAYER 6: V1 ENGINE            ███████████████   +35% speed  │   │
│  │ vLLM's latest architecture with optimized scheduling        │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ LAYER 5: FLASHINFER SM121     ██████████████    +30% speed  │   │
│  │ NVIDIA FlashInfer 0.5.1.nv25.11 CUTLASS FP4 kernels         │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ LAYER 4: NVFP4 MoE CUTLASS    █████████████     +25% speed  │   │
│  │ FlashInfer CUTLASS FP4 for MoE layers (ReLU² support)       │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ LAYER 3: NVFP4 GEMM           ████████████      +20% speed  │   │
│  │ FP4 E2M1 matrix multiplication with AWQ quantization        │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ LAYER 2: FP8 KV CACHE         ███████████       +15% speed  │   │
│  │ 50% KV cache memory reduction for longer contexts           │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ LAYER 1: NVFP4 W4A16          ██████████        72% smaller │   │
│  │ 60GB → 18GB model size, <0.3% accuracy loss                 │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  RESULT: 4.8 tok/s → 70+ tok/s | 14.5x SPEEDUP                     │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Supported Hardware

Hardware SM Version Memory Performance
DGX-Spark GB10 SM121 128 GB Primary Target
GB100 SM121 192 GB Excellent
GB200 NVL SM121 384 GB Maximum Scale

Note: This model is optimized for Blackwell GPUs (SM121). For H100/A100, consider the BF16 version with our multi-arch container.


Model Architecture

Nemotron 3 Nano 30B is a hybrid Mamba-MoE architecture:

  • Hybrid Layers: Combines Mamba SSM with MoE transformers
  • MoE Configuration: Mixture of Experts with ReLU² activation
  • Parameters: 30B total, ~8B active per token
  • Context Length: 128K tokens supported

Required Environment Variables

For optimal performance with NVFP4 MoE layers:

VLLM_USE_V1=1
VLLM_ATTENTION_BACKEND=FLASHINFER
VLLM_CUDA_GRAPH_MODE=full_and_piecewise
VLLM_USE_FLASHINFER_MOE_FP4=1
VLLM_FLASHINFER_MOE_BACKEND=throughput

Important: The VLLM_USE_FLASHINFER_MOE_FP4=1 and VLLM_FLASHINFER_MOE_BACKEND=throughput variables are required for non-gated activations (ReLU²) in NVFP4 MoE models.


ELK-AI Docker Ecosystem

Repository Purpose
elkaioptimization/vllm-nvfp4-cuda-13 Pre-loaded NVFP4 models
mutazai/vllm-spark-blackwell-nvfp4-optimized Blackwell inference base
mutazai/nvfp4-cuda13-sota-quantization Quantization pipeline

Citation

@misc{nemotron3-nvfp4-elkai,
  author = {Al Awamleh, Mutaz},
  title = {Nemotron 3 Nano 30B NVFP4 W4A16 - ELK-AI Optimized},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/mutazai/nemotron3-nano-nvfp4-w4a16}}
}

About ELK-AI

ELK-AI specializes in enterprise AI optimization, delivering production-ready LLM solutions with state-of-the-art performance.


License

This model inherits the NVIDIA Open Model License from the base model.


Quantized with care by Mutaz Al Awamleh | ELK-AI

14.5x faster inference. 72% smaller. Production ready.

Downloads last month
325
Safetensors
Model size
18B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cybermotaz/nemotron3-nano-nvfp4-w4a16

Quantized
(11)
this model