LLM Quantization Review

This blog post provides an overview of the fundamental concepts of quantization, as well as a review of mainstream quantization methods in the context of LLMs.

October 2, 2023 · 11 min · Sherlock

Spin to Win: The Power of Rotation in LLM Quantization

介绍 LLM 量化中的旋转技术以及相关的优化方案

December 2, 2025 · 1 min · Sherlock

MXFP4 and NVFP4

介绍 MXFP4 和 NVFP4 的区别

August 28, 2025 · 2 min · Sherlock

W4A8KV4 Quantization Summary and Best Practices

Comprehensive summary of W4A8KV4 quantization techniques, covering KV4 and W4A8 optimization methods with practical recommendations.

August 30, 2024 · 6 min · Sherlock

Low-Bit MoE Quantization for Large Language Models

Comprehensive guide to quantizing large MoE models like DeepSeek-V3/R1, covering techniques for efficient memory usage and inference optimization.

July 25, 2024 · 3 min · Sherlock

Speculative Sampling for Faster LLM Inference

Deep dive into speculative sampling techniques for accelerating LLM inference through draft model prediction and rejection sampling.

June 20, 2024 · 3 min · Sherlock

DeepSeek-v2 In a Nutshell - Multi-Head Latent Attention

Understanding DeepSeek-v2’s MLA architecture and its solutions to KV cache memory challenges in long context LLM inference.

May 15, 2024 · 2 min · Sherlock

LayerNorm Mathematical Derivation and Implementation

Comprehensive mathematical derivation of LayerNorm forward and backward passes, including PyTorch implementation details.

April 10, 2024 · 4 min · Sherlock

From Softmax to FlashAttention

Deep dive into the mathematical foundations of flash attention, from softmax fundamentals to efficient kernel implementation.

March 20, 2024 · 8 min · Sherlock

8-bit KV Cache

This blog introduces KV Cache quantization in LLM inference.

January 24, 2024 · 8 min · Sherlock