|
|
<!DOCTYPE html> |
|
|
<html lang="en"> |
|
|
<head> |
|
|
<meta charset="UTF-8"> |
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"> |
|
|
<title>LLM Quantization Formats & CUDA Support Reference</title> |
|
|
<style> |
|
|
* { |
|
|
margin: 0; |
|
|
padding: 0; |
|
|
box-sizing: border-box; |
|
|
} |
|
|
|
|
|
body { |
|
|
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', 'Oxygen', 'Ubuntu', 'Cantarell', sans-serif; |
|
|
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); |
|
|
min-height: 100vh; |
|
|
padding: 2rem; |
|
|
color: #333; |
|
|
} |
|
|
|
|
|
.container { |
|
|
max-width: 1400px; |
|
|
margin: 0 auto; |
|
|
background: white; |
|
|
border-radius: 16px; |
|
|
box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3); |
|
|
overflow: hidden; |
|
|
} |
|
|
|
|
|
header { |
|
|
background: linear-gradient(135deg, #1e3c72 0%, #2a5298 100%); |
|
|
color: white; |
|
|
padding: 3rem 2rem; |
|
|
text-align: center; |
|
|
} |
|
|
|
|
|
h1 { |
|
|
font-size: 2.5rem; |
|
|
margin-bottom: 0.5rem; |
|
|
font-weight: 700; |
|
|
} |
|
|
|
|
|
.subtitle { |
|
|
font-size: 1.1rem; |
|
|
opacity: 0.9; |
|
|
font-weight: 300; |
|
|
} |
|
|
|
|
|
.content { |
|
|
padding: 2rem; |
|
|
} |
|
|
|
|
|
.section { |
|
|
margin-bottom: 3rem; |
|
|
} |
|
|
|
|
|
h2 { |
|
|
color: #1e3c72; |
|
|
font-size: 1.8rem; |
|
|
margin-bottom: 1.5rem; |
|
|
border-bottom: 3px solid #667eea; |
|
|
padding-bottom: 0.5rem; |
|
|
} |
|
|
|
|
|
.table-wrapper { |
|
|
overflow-x: auto; |
|
|
margin-bottom: 2rem; |
|
|
border-radius: 8px; |
|
|
box-shadow: 0 2px 8px rgba(0, 0, 0, 0.1); |
|
|
} |
|
|
|
|
|
table { |
|
|
width: 100%; |
|
|
border-collapse: collapse; |
|
|
font-size: 0.95rem; |
|
|
} |
|
|
|
|
|
thead { |
|
|
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); |
|
|
color: white; |
|
|
} |
|
|
|
|
|
th { |
|
|
padding: 1rem; |
|
|
text-align: left; |
|
|
font-weight: 600; |
|
|
text-transform: uppercase; |
|
|
font-size: 0.85rem; |
|
|
letter-spacing: 0.5px; |
|
|
} |
|
|
|
|
|
td { |
|
|
padding: 0.9rem 1rem; |
|
|
border-bottom: 1px solid #e5e7eb; |
|
|
} |
|
|
|
|
|
tbody tr { |
|
|
transition: background-color 0.2s; |
|
|
} |
|
|
|
|
|
tbody tr:hover { |
|
|
background-color: #f3f4f6; |
|
|
} |
|
|
|
|
|
tbody tr:nth-child(even) { |
|
|
background-color: #f9fafb; |
|
|
} |
|
|
|
|
|
.highlight { |
|
|
background: linear-gradient(135deg, #fef3c7 0%, #fde68a 100%); |
|
|
font-weight: 600; |
|
|
} |
|
|
|
|
|
.cuda-grid { |
|
|
display: grid; |
|
|
grid-template-columns: repeat(auto-fit, minmax(250px, 1fr)); |
|
|
gap: 1.5rem; |
|
|
margin-top: 1rem; |
|
|
} |
|
|
|
|
|
.cuda-card { |
|
|
background: linear-gradient(135deg, #f3f4f6 0%, #e5e7eb 100%); |
|
|
padding: 1.5rem; |
|
|
border-radius: 8px; |
|
|
border-left: 4px solid #667eea; |
|
|
} |
|
|
|
|
|
.cuda-card h3 { |
|
|
color: #1e3c72; |
|
|
font-size: 1.2rem; |
|
|
margin-bottom: 0.5rem; |
|
|
} |
|
|
|
|
|
.cuda-card p { |
|
|
color: #6b7280; |
|
|
line-height: 1.6; |
|
|
} |
|
|
|
|
|
.notes-grid { |
|
|
display: grid; |
|
|
gap: 1rem; |
|
|
margin-top: 1rem; |
|
|
} |
|
|
|
|
|
.note-item { |
|
|
background: #f0f9ff; |
|
|
padding: 1rem; |
|
|
border-radius: 6px; |
|
|
border-left: 3px solid #3b82f6; |
|
|
} |
|
|
|
|
|
.note-item strong { |
|
|
color: #1e40af; |
|
|
} |
|
|
|
|
|
footer { |
|
|
background: #f9fafb; |
|
|
padding: 2rem; |
|
|
text-align: center; |
|
|
color: #6b7280; |
|
|
border-top: 1px solid #e5e7eb; |
|
|
} |
|
|
|
|
|
@media (max-width: 768px) { |
|
|
body { |
|
|
padding: 1rem; |
|
|
} |
|
|
|
|
|
h1 { |
|
|
font-size: 1.8rem; |
|
|
} |
|
|
|
|
|
.content { |
|
|
padding: 1rem; |
|
|
} |
|
|
|
|
|
table { |
|
|
font-size: 0.85rem; |
|
|
} |
|
|
|
|
|
th, td { |
|
|
padding: 0.7rem 0.5rem; |
|
|
} |
|
|
} |
|
|
</style> |
|
|
</head> |
|
|
<body> |
|
|
<div class="container"> |
|
|
<header> |
|
|
<h1>🚀 Quantization Formats & CUDA Support</h1> |
|
|
<p class="subtitle">Complete reference guide for LLM quantization methods and hardware requirements</p> |
|
|
</header> |
|
|
|
|
|
<div class="content"> |
|
|
<div class="section"> |
|
|
<h2>📊 Quantization Formats</h2> |
|
|
<div class="table-wrapper"> |
|
|
<table> |
|
|
<thead> |
|
|
<tr> |
|
|
<th>Format</th> |
|
|
<th>Bits</th> |
|
|
<th>Min CUDA</th> |
|
|
<th>GPU Examples</th> |
|
|
<th>Notes</th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody> |
|
|
<tr> |
|
|
<td><strong>FP16</strong></td> |
|
|
<td>16</td> |
|
|
<td>5.3+</td> |
|
|
<td>GTX 1000, RTX 2000+</td> |
|
|
<td>Native half precision</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>BF16</strong></td> |
|
|
<td>16</td> |
|
|
<td>8.0+</td> |
|
|
<td>A100, RTX 3090, 4090</td> |
|
|
<td>Better range than FP16</td> |
|
|
</tr> |
|
|
<tr class="highlight"> |
|
|
<td><strong>FP8 (E4M3/E5M2)</strong></td> |
|
|
<td>8</td> |
|
|
<td>8.9+</td> |
|
|
<td>H100, H200, L40S</td> |
|
|
<td>Transformer Engine support</td> |
|
|
</tr> |
|
|
<tr class="highlight"> |
|
|
<td><strong>MXFP8</strong></td> |
|
|
<td>8</td> |
|
|
<td>8.9+</td> |
|
|
<td>H100, H200, Blackwell</td> |
|
|
<td>Block-size 32, E8M0 scale</td> |
|
|
</tr> |
|
|
<tr class="highlight"> |
|
|
<td><strong>FP6</strong></td> |
|
|
<td>6</td> |
|
|
<td>10.0+</td> |
|
|
<td>GB200, B100, B200</td> |
|
|
<td>Blackwell native support</td> |
|
|
</tr> |
|
|
<tr class="highlight"> |
|
|
<td><strong>MXFP6</strong></td> |
|
|
<td>6</td> |
|
|
<td>8.9+</td> |
|
|
<td>H100+, Blackwell</td> |
|
|
<td>E2M3/E3M2, block-size 32</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>INT8</strong></td> |
|
|
<td>8</td> |
|
|
<td>6.1+</td> |
|
|
<td>GTX 1080+, P100+</td> |
|
|
<td>Wide compatibility</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>INT4</strong></td> |
|
|
<td>4</td> |
|
|
<td>7.5+</td> |
|
|
<td>RTX 2080+, T4, V100</td> |
|
|
<td>CUTLASS kernels</td> |
|
|
</tr> |
|
|
<tr class="highlight"> |
|
|
<td><strong>MXFP4</strong></td> |
|
|
<td>4</td> |
|
|
<td>9.0+</td> |
|
|
<td>H100, H200, GB200</td> |
|
|
<td>E2M1, block-size 32, OpenAI</td> |
|
|
</tr> |
|
|
<tr class="highlight"> |
|
|
<td><strong>NVFP4</strong></td> |
|
|
<td>4</td> |
|
|
<td>10.0+</td> |
|
|
<td>GB200, B100, B200</td> |
|
|
<td>E2M1, block-size 16, dual-scale</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>GPTQ</strong></td> |
|
|
<td>2-8</td> |
|
|
<td>7.0+</td> |
|
|
<td>RTX 2000+, V100+</td> |
|
|
<td>Group-wise quantization</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>AWQ</strong></td> |
|
|
<td>4</td> |
|
|
<td>7.5+</td> |
|
|
<td>RTX 3000+, A100+</td> |
|
|
<td>Activation-aware</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>QuIP</strong></td> |
|
|
<td>2-4</td> |
|
|
<td>7.0+</td> |
|
|
<td>RTX 2000+, V100+</td> |
|
|
<td>Incoherence processing</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>QuIP#</strong></td> |
|
|
<td>2-4</td> |
|
|
<td>8.0+</td> |
|
|
<td>RTX 3090+, A100+</td> |
|
|
<td>E8P lattice codebook</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>GGUF/GGML</strong></td> |
|
|
<td>2-8</td> |
|
|
<td>6.1+</td> |
|
|
<td>GTX 1060+, most GPUs</td> |
|
|
<td>CPU fallback available</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>EXL2</strong></td> |
|
|
<td>2-8</td> |
|
|
<td>7.5+</td> |
|
|
<td>RTX 2000+, V100+</td> |
|
|
<td>Variable bit-width</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>NF4</strong></td> |
|
|
<td>4</td> |
|
|
<td>7.0+</td> |
|
|
<td>RTX 2000+, V100+</td> |
|
|
<td>QLoRA, normal float</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>GGUF-IQ</strong></td> |
|
|
<td>1-8</td> |
|
|
<td>6.1+</td> |
|
|
<td>GTX 1060+</td> |
|
|
<td>Importance matrix</td> |
|
|
</tr> |
|
|
</tbody> |
|
|
</table> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
<div class="section"> |
|
|
<h2>🎯 CUDA Compute Capabilities</h2> |
|
|
<div class="cuda-grid"> |
|
|
<div class="cuda-card"> |
|
|
<h3>6.1 - Pascal</h3> |
|
|
<p>GTX 1000 series, Tesla P100</p> |
|
|
</div> |
|
|
<div class="cuda-card"> |
|
|
<h3>7.0 - Volta</h3> |
|
|
<p>Tesla V100, Titan V</p> |
|
|
</div> |
|
|
<div class="cuda-card"> |
|
|
<h3>7.5 - Turing</h3> |
|
|
<p>RTX 2000 series, T4, RTX 6000</p> |
|
|
</div> |
|
|
<div class="cuda-card"> |
|
|
<h3>8.0 - Ampere</h3> |
|
|
<p>A100, RTX 3090</p> |
|
|
</div> |
|
|
<div class="cuda-card"> |
|
|
<h3>8.6 - Ampere</h3> |
|
|
<p>RTX 3000 series consumer</p> |
|
|
</div> |
|
|
<div class="cuda-card"> |
|
|
<h3>8.9 - Ada Lovelace</h3> |
|
|
<p>RTX 4000 series, L40S</p> |
|
|
</div> |
|
|
<div class="cuda-card"> |
|
|
<h3>9.0 - Hopper</h3> |
|
|
<p>H100, H200</p> |
|
|
</div> |
|
|
<div class="cuda-card"> |
|
|
<h3>10.0 - Blackwell</h3> |
|
|
<p>GB200, B100, B200</p> |
|
|
</div> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
<div class="section"> |
|
|
<h2>⚡ Performance Notes</h2> |
|
|
<div class="notes-grid"> |
|
|
<div class="note-item"> |
|
|
<strong>FP8/MXFP8:</strong> Transformer Engine support, 2x faster than BF16 on H100+ |
|
|
</div> |
|
|
<div class="note-item"> |
|
|
<strong>NVFP4:</strong> Native on Blackwell, 2x faster than FP8, ~3.5x memory reduction vs FP16 |
|
|
</div> |
|
|
<div class="note-item"> |
|
|
<strong>MXFP4:</strong> Requires H100+ (CC 9.0), uses Triton kernels, OpenAI GPT-OSS format |
|
|
</div> |
|
|
<div class="note-item"> |
|
|
<strong>MXFP6:</strong> Training & inference on H100+, better accuracy than MXFP4 |
|
|
</div> |
|
|
<div class="note-item"> |
|
|
<strong>QuIP#:</strong> 2-4 bit with E8P lattice codebook, 50% peak bandwidth on RTX 4090 |
|
|
</div> |
|
|
<div class="note-item"> |
|
|
<strong>INT4/GPTQ/AWQ:</strong> ~3-4x memory reduction, 1.5-2x faster inference |
|
|
</div> |
|
|
<div class="note-item"> |
|
|
<strong>GGUF:</strong> Best CPU/GPU hybrid performance |
|
|
</div> |
|
|
<div class="note-item"> |
|
|
<strong>EXL2:</strong> Highest quality at low bits, slower than GPTQ |
|
|
</div> |
|
|
</div> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
<footer> |
|
|
<p>Last updated: November 2025 | Reference for LLM quantization formats and CUDA compute capability requirements</p> |
|
|
</footer> |
|
|
</div> |
|
|
</body> |
|
|
</html> |