Posts

All the articles I've posted.

W4A8KV4 Quantization Summary and Best Practices
Posted on:August 30, 2024 at 12:00 AM
Comprehensive summary of W4A8KV4 quantization techniques, covering KV4 and W4A8 optimization methods with practical recommendations.
Low-Bit MoE Quantization for Large Language Models
Posted on:July 25, 2024 at 12:00 AM
Comprehensive guide to quantizing large MoE models like DeepSeek-V3/R1, covering techniques for efficient memory usage and inference optimization.
Speculative Sampling for Faster LLM Inference
Posted on:June 20, 2024 at 12:00 AM
Deep dive into speculative sampling techniques for accelerating LLM inference through draft model prediction and rejection sampling.
DeepSeek-v2 In a Nutshell - Multi-Head Latent Attention
Posted on:May 15, 2024 at 12:00 AM
Understanding DeepSeek-v2's MLA architecture and its solutions to KV cache memory challenges in long context LLM inference.
LayerNorm Mathematical Derivation and Implementation
Posted on:April 10, 2024 at 12:00 AM
Comprehensive mathematical derivation of LayerNorm forward and backward passes, including PyTorch implementation details.

W4A8KV4 Quantization Summary and Best Practices