Hi, I’m Sherlock!

An engineer specializing in LLM, computer vision, and deep learning. Recently, I delve into various aspects of LLM, including pre-training, SFT, RLHF, and the related infra.

LLM Quantization Review

This blog post provides an overview of the fundamental concepts of quantization, as well as a review of mainstream quantization methods in the context of LLMs.

Spin to Win: The Power of Rotation in LLM Quantization

介绍 LLM 量化中的旋转技术以及相关的优化方案

MXFP4 and NVFP4

介绍 MXFP4 和 NVFP4 的区别

W4A8KV4 Quantization Summary and Best Practices

Comprehensive summary of W4A8KV4 quantization techniques, covering KV4 and W4A8 optimization methods with practical recommendations.

Low-Bit MoE Quantization for Large Language Models

Comprehensive guide to quantizing large MoE models like DeepSeek-V3/R1, covering techniques for efficient memory usage and inference optimization.

Speculative Sampling for Faster LLM Inference

Deep dive into speculative sampling techniques for accelerating LLM inference through draft model prediction and rejection sampling.

DeepSeek-v2 In a Nutshell - Multi-Head Latent Attention

Understanding DeepSeek-v2’s MLA architecture and its solutions to KV cache memory challenges in long context LLM inference.

LayerNorm Mathematical Derivation and Implementation

Comprehensive mathematical derivation of LayerNorm forward and backward passes, including PyTorch implementation details.

From Softmax to FlashAttention

Deep dive into the mathematical foundations of flash attention, from softmax fundamentals to efficient kernel implementation.

Fast Hadamard Transform

A comprehensive guide to the Fast Hadamard Transform, its mathematical foundations, and practical implementation with code examples.