BarraHome's picture
Update index.html
0a6e286 verified
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>LLM Quantization Formats & CUDA Support Reference</title>
<style>
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', 'Oxygen', 'Ubuntu', 'Cantarell', sans-serif;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
min-height: 100vh;
padding: 2rem;
color: #333;
}
.container {
max-width: 1400px;
margin: 0 auto;
background: white;
border-radius: 16px;
box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
overflow: hidden;
}
header {
background: linear-gradient(135deg, #1e3c72 0%, #2a5298 100%);
color: white;
padding: 3rem 2rem;
text-align: center;
}
h1 {
font-size: 2.5rem;
margin-bottom: 0.5rem;
font-weight: 700;
}
.subtitle {
font-size: 1.1rem;
opacity: 0.9;
font-weight: 300;
}
.content {
padding: 2rem;
}
.section {
margin-bottom: 3rem;
}
h2 {
color: #1e3c72;
font-size: 1.8rem;
margin-bottom: 1.5rem;
border-bottom: 3px solid #667eea;
padding-bottom: 0.5rem;
}
.table-wrapper {
overflow-x: auto;
margin-bottom: 2rem;
border-radius: 8px;
box-shadow: 0 2px 8px rgba(0, 0, 0, 0.1);
}
table {
width: 100%;
border-collapse: collapse;
font-size: 0.95rem;
}
thead {
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
}
th {
padding: 1rem;
text-align: left;
font-weight: 600;
text-transform: uppercase;
font-size: 0.85rem;
letter-spacing: 0.5px;
}
td {
padding: 0.9rem 1rem;
border-bottom: 1px solid #e5e7eb;
}
tbody tr {
transition: background-color 0.2s;
}
tbody tr:hover {
background-color: #f3f4f6;
}
tbody tr:nth-child(even) {
background-color: #f9fafb;
}
.highlight {
background: linear-gradient(135deg, #fef3c7 0%, #fde68a 100%);
font-weight: 600;
}
.cuda-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
gap: 1.5rem;
margin-top: 1rem;
}
.cuda-card {
background: linear-gradient(135deg, #f3f4f6 0%, #e5e7eb 100%);
padding: 1.5rem;
border-radius: 8px;
border-left: 4px solid #667eea;
}
.cuda-card h3 {
color: #1e3c72;
font-size: 1.2rem;
margin-bottom: 0.5rem;
}
.cuda-card p {
color: #6b7280;
line-height: 1.6;
}
.notes-grid {
display: grid;
gap: 1rem;
margin-top: 1rem;
}
.note-item {
background: #f0f9ff;
padding: 1rem;
border-radius: 6px;
border-left: 3px solid #3b82f6;
}
.note-item strong {
color: #1e40af;
}
footer {
background: #f9fafb;
padding: 2rem;
text-align: center;
color: #6b7280;
border-top: 1px solid #e5e7eb;
}
@media (max-width: 768px) {
body {
padding: 1rem;
}
h1 {
font-size: 1.8rem;
}
.content {
padding: 1rem;
}
table {
font-size: 0.85rem;
}
th, td {
padding: 0.7rem 0.5rem;
}
}
</style>
</head>
<body>
<div class="container">
<header>
<h1>🚀 Quantization Formats & CUDA Support</h1>
<p class="subtitle">Complete reference guide for LLM quantization methods and hardware requirements</p>
</header>
<div class="content">
<div class="section">
<h2>📊 Quantization Formats</h2>
<div class="table-wrapper">
<table>
<thead>
<tr>
<th>Format</th>
<th>Bits</th>
<th>Min CUDA</th>
<th>GPU Examples</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>FP16</strong></td>
<td>16</td>
<td>5.3+</td>
<td>GTX 1000, RTX 2000+</td>
<td>Native half precision</td>
</tr>
<tr>
<td><strong>BF16</strong></td>
<td>16</td>
<td>8.0+</td>
<td>A100, RTX 3090, 4090</td>
<td>Better range than FP16</td>
</tr>
<tr class="highlight">
<td><strong>FP8 (E4M3/E5M2)</strong></td>
<td>8</td>
<td>8.9+</td>
<td>H100, H200, L40S</td>
<td>Transformer Engine support</td>
</tr>
<tr class="highlight">
<td><strong>MXFP8</strong></td>
<td>8</td>
<td>8.9+</td>
<td>H100, H200, Blackwell</td>
<td>Block-size 32, E8M0 scale</td>
</tr>
<tr class="highlight">
<td><strong>FP6</strong></td>
<td>6</td>
<td>10.0+</td>
<td>GB200, B100, B200</td>
<td>Blackwell native support</td>
</tr>
<tr class="highlight">
<td><strong>MXFP6</strong></td>
<td>6</td>
<td>8.9+</td>
<td>H100+, Blackwell</td>
<td>E2M3/E3M2, block-size 32</td>
</tr>
<tr>
<td><strong>INT8</strong></td>
<td>8</td>
<td>6.1+</td>
<td>GTX 1080+, P100+</td>
<td>Wide compatibility</td>
</tr>
<tr>
<td><strong>INT4</strong></td>
<td>4</td>
<td>7.5+</td>
<td>RTX 2080+, T4, V100</td>
<td>CUTLASS kernels</td>
</tr>
<tr class="highlight">
<td><strong>MXFP4</strong></td>
<td>4</td>
<td>9.0+</td>
<td>H100, H200, GB200</td>
<td>E2M1, block-size 32, OpenAI</td>
</tr>
<tr class="highlight">
<td><strong>NVFP4</strong></td>
<td>4</td>
<td>10.0+</td>
<td>GB200, B100, B200</td>
<td>E2M1, block-size 16, dual-scale</td>
</tr>
<tr>
<td><strong>GPTQ</strong></td>
<td>2-8</td>
<td>7.0+</td>
<td>RTX 2000+, V100+</td>
<td>Group-wise quantization</td>
</tr>
<tr>
<td><strong>AWQ</strong></td>
<td>4</td>
<td>7.5+</td>
<td>RTX 3000+, A100+</td>
<td>Activation-aware</td>
</tr>
<tr>
<td><strong>QuIP</strong></td>
<td>2-4</td>
<td>7.0+</td>
<td>RTX 2000+, V100+</td>
<td>Incoherence processing</td>
</tr>
<tr>
<td><strong>QuIP#</strong></td>
<td>2-4</td>
<td>8.0+</td>
<td>RTX 3090+, A100+</td>
<td>E8P lattice codebook</td>
</tr>
<tr>
<td><strong>GGUF/GGML</strong></td>
<td>2-8</td>
<td>6.1+</td>
<td>GTX 1060+, most GPUs</td>
<td>CPU fallback available</td>
</tr>
<tr>
<td><strong>EXL2</strong></td>
<td>2-8</td>
<td>7.5+</td>
<td>RTX 2000+, V100+</td>
<td>Variable bit-width</td>
</tr>
<tr>
<td><strong>NF4</strong></td>
<td>4</td>
<td>7.0+</td>
<td>RTX 2000+, V100+</td>
<td>QLoRA, normal float</td>
</tr>
<tr>
<td><strong>GGUF-IQ</strong></td>
<td>1-8</td>
<td>6.1+</td>
<td>GTX 1060+</td>
<td>Importance matrix</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section">
<h2>🎯 CUDA Compute Capabilities</h2>
<div class="cuda-grid">
<div class="cuda-card">
<h3>6.1 - Pascal</h3>
<p>GTX 1000 series, Tesla P100</p>
</div>
<div class="cuda-card">
<h3>7.0 - Volta</h3>
<p>Tesla V100, Titan V</p>
</div>
<div class="cuda-card">
<h3>7.5 - Turing</h3>
<p>RTX 2000 series, T4, RTX 6000</p>
</div>
<div class="cuda-card">
<h3>8.0 - Ampere</h3>
<p>A100, RTX 3090</p>
</div>
<div class="cuda-card">
<h3>8.6 - Ampere</h3>
<p>RTX 3000 series consumer</p>
</div>
<div class="cuda-card">
<h3>8.9 - Ada Lovelace</h3>
<p>RTX 4000 series, L40S</p>
</div>
<div class="cuda-card">
<h3>9.0 - Hopper</h3>
<p>H100, H200</p>
</div>
<div class="cuda-card">
<h3>10.0 - Blackwell</h3>
<p>GB200, B100, B200</p>
</div>
</div>
</div>
<div class="section">
<h2>⚡ Performance Notes</h2>
<div class="notes-grid">
<div class="note-item">
<strong>FP8/MXFP8:</strong> Transformer Engine support, 2x faster than BF16 on H100+
</div>
<div class="note-item">
<strong>NVFP4:</strong> Native on Blackwell, 2x faster than FP8, ~3.5x memory reduction vs FP16
</div>
<div class="note-item">
<strong>MXFP4:</strong> Requires H100+ (CC 9.0), uses Triton kernels, OpenAI GPT-OSS format
</div>
<div class="note-item">
<strong>MXFP6:</strong> Training & inference on H100+, better accuracy than MXFP4
</div>
<div class="note-item">
<strong>QuIP#:</strong> 2-4 bit with E8P lattice codebook, 50% peak bandwidth on RTX 4090
</div>
<div class="note-item">
<strong>INT4/GPTQ/AWQ:</strong> ~3-4x memory reduction, 1.5-2x faster inference
</div>
<div class="note-item">
<strong>GGUF:</strong> Best CPU/GPU hybrid performance
</div>
<div class="note-item">
<strong>EXL2:</strong> Highest quality at low bits, slower than GPTQ
</div>
</div>
</div>
</div>
<footer>
<p>Last updated: November 2025 | Reference for LLM quantization formats and CUDA compute capability requirements</p>
</footer>
</div>
</body>
</html>