Nemotron 3 Nano 30B - NVFP4 W4A16 Quantized

By Mutaz Al Awamleh | ELK-AI

60GB → 18GB | 72% Memory Reduction | <0.3% Accuracy Loss | 14.5x Faster

Model Description

This is the NVFP4 W4A16 quantized version of nvidia/Nemotron-3-Nano-30B-v1, optimized by Mutaz Al Awamleh at ELK-AI for maximum inference performance on NVIDIA Blackwell GPUs.

Quantization Details

Attribute	Value
Original Model	nvidia/Nemotron-3-Nano-30B-v1
Quantization Method	NVFP4 W4A16 (FP4 E2M1)
Algorithm	AWQ with block size 32
Calibration Dataset	open_code_reasoning
Calibration Samples	1024
Original Size	60 GB (BF16)
Quantized Size	18 GB (NVFP4)
Memory Reduction	72%
Accuracy Loss	<0.3%

Performance

┌─────────────────────────────────────────────────────────────┐
│         NEMOTRON 3 NANO 30B - ELK-AI NVFP4 BENCHMARK        │
│              Tested on DGX-Spark GB10 (SM121)               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Configuration         │ Speed      │ Memory   │ Context   │
│  ──────────────────────┼────────────┼──────────┼────────── │
│  BF16 (baseline)       │ 4.8 tok/s  │ 60 GB    │ 16K       │
│  BF16 + CUDA Graphs    │ 28.4 tok/s │ 60 GB    │ 16K       │
│  NVFP4 + FP8 KV Cache  │ 70+ tok/s  │ 18 GB    │ 64K+      │
│                                                             │
│  SPEEDUP: 14.5x FASTER | MEMORY: 72% SMALLER               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Quick Start

Using vLLM (Recommended)

# Pull ELK-AI optimized container
docker pull mutazai/vllm-spark-blackwell-nvfp4-optimized:2.5.0

# Run inference
docker run --gpus all --ipc=host \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  -v /path/to/this/model:/model \
  -p 8000:8000 \
  mutazai/vllm-spark-blackwell-nvfp4-optimized:2.5.0 \
  python -m vllm.entrypoints.openai.api_server \
    --model /model \
    --trust-remote-code \
    --quantization modelopt_fp4 \
    --kv-cache-dtype fp8

Using Pre-Loaded Container (Zero Config)

# Just run - model is pre-loaded!
docker run --gpus all -p 8000:8000 \
  elkaioptimization/vllm-nvfp4-cuda-13:nemotron3-30b-nvfp4-1.0

Test the API

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model",
    "messages": [{"role": "user", "content": "Explain quantum computing simply."}],
    "max_tokens": 200
  }'

ELK-AI Optimization Stack

This model achieves 14.5x speedup through our 7-layer optimization stack:

┌─────────────────────────────────────────────────────────────────────┐
│                    ELK-AI OPTIMIZATION LAYERS                       │
│                  by Mutaz Al Awamleh | ELK-AI                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ LAYER 7: CUDA GRAPHS          ████████████████  +40% speed  │   │
│  │ Pre-compiled execution graphs, zero kernel launch overhead  │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ LAYER 6: V1 ENGINE            ███████████████   +35% speed  │   │
│  │ vLLM's latest architecture with optimized scheduling        │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ LAYER 5: FLASHINFER SM121     ██████████████    +30% speed  │   │
│  │ NVIDIA FlashInfer 0.5.1.nv25.11 CUTLASS FP4 kernels         │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ LAYER 4: NVFP4 MoE CUTLASS    █████████████     +25% speed  │   │
│  │ FlashInfer CUTLASS FP4 for MoE layers (ReLU² support)       │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ LAYER 3: NVFP4 GEMM           ████████████      +20% speed  │   │
│  │ FP4 E2M1 matrix multiplication with AWQ quantization        │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ LAYER 2: FP8 KV CACHE         ███████████       +15% speed  │   │
│  │ 50% KV cache memory reduction for longer contexts           │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              ▼                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ LAYER 1: NVFP4 W4A16          ██████████        72% smaller │   │
│  │ 60GB → 18GB model size, <0.3% accuracy loss                 │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  RESULT: 4.8 tok/s → 70+ tok/s | 14.5x SPEEDUP                     │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Supported Hardware

Hardware	SM Version	Memory	Performance
DGX-Spark GB10	SM121	128 GB	Primary Target
GB100	SM121	192 GB	Excellent
GB200 NVL	SM121	384 GB	Maximum Scale

Note: This model is optimized for Blackwell GPUs (SM121). For H100/A100, consider the BF16 version with our multi-arch container.

Model Architecture

Nemotron 3 Nano 30B is a hybrid Mamba-MoE architecture:

Hybrid Layers: Combines Mamba SSM with MoE transformers
MoE Configuration: Mixture of Experts with ReLU² activation
Parameters: 30B total, ~8B active per token
Context Length: 128K tokens supported

Required Environment Variables

For optimal performance with NVFP4 MoE layers:

VLLM_USE_V1=1
VLLM_ATTENTION_BACKEND=FLASHINFER
VLLM_CUDA_GRAPH_MODE=full_and_piecewise
VLLM_USE_FLASHINFER_MOE_FP4=1
VLLM_FLASHINFER_MOE_BACKEND=throughput

Important: The VLLM_USE_FLASHINFER_MOE_FP4=1 and VLLM_FLASHINFER_MOE_BACKEND=throughput variables are required for non-gated activations (ReLU²) in NVFP4 MoE models.

ELK-AI Docker Ecosystem

Repository	Purpose
elkaioptimization/vllm-nvfp4-cuda-13	Pre-loaded NVFP4 models
mutazai/vllm-spark-blackwell-nvfp4-optimized	Blackwell inference base
mutazai/nvfp4-cuda13-sota-quantization	Quantization pipeline

Citation

@misc{nemotron3-nvfp4-elkai,
  author = {Al Awamleh, Mutaz},
  title = {Nemotron 3 Nano 30B NVFP4 W4A16 - ELK-AI Optimized},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/mutazai/nemotron3-nano-nvfp4-w4a16}}
}

About ELK-AI

ELK-AI specializes in enterprise AI optimization, delivering production-ready LLM solutions with state-of-the-art performance.

Website: https://elkai.ai
Author: Mutaz Al Awamleh
Email: [email protected]
Docker Hub: mutazai | elkaioptimization

License

This model inherits the NVIDIA Open Model License from the base model.

Quantized with care by Mutaz Al Awamleh | ELK-AI

14.5x faster inference. 72% smaller. Production ready.

Downloads last month: 325

Safetensors

Model size

18B params

Tensor type

F32

BF16

F8_E4M3

Model tree for cybermotaz/nemotron3-nano-nvfp4-w4a16

Base model

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Quantized

(11)

this model