

LLM Quantization Review
This blog post provides an overview of the fundamental concepts of quantization, as well as a review of mainstream quantization methods in the context of LLMs.
Spin to Win: The Power of Rotation in LLM Quantization
介绍 LLM 量化中的旋转技术以及相关的优化方案
MXFP4 and NVFP4
介绍 MXFP4 和 NVFP4 的区别
W4A8KV4 Quantization Summary and Best Practices
Comprehensive summary of W4A8KV4 quantization techniques, covering KV4 and W4A8 optimization methods with practical recommendations.
Low-Bit MoE Quantization for Large Language Models
Comprehensive guide to quantizing large MoE models like DeepSeek-V3/R1, covering techniques for efficient memory usage and inference optimization.
Speculative Sampling for Faster LLM Inference
Deep dive into speculative sampling techniques for accelerating LLM inference through draft model prediction and rejection sampling.
DeepSeek-v2 In a Nutshell - Multi-Head Latent Attention
Understanding DeepSeek-v2’s MLA architecture and its solutions to KV cache memory challenges in long context LLM inference.
LayerNorm Mathematical Derivation and Implementation
Comprehensive mathematical derivation of LayerNorm forward and backward passes, including PyTorch implementation details.
From Softmax to FlashAttention
Deep dive into the mathematical foundations of flash attention, from softmax fundamentals to efficient kernel implementation.
Fast Hadamard Transform
A comprehensive guide to the Fast Hadamard Transform, its mathematical foundations, and practical implementation with code examples.