DeepSeek-v2 In a Nutshell - Multi-Head Latent Attention

Understanding DeepSeek-v2’s MLA architecture and its solutions to KV cache memory challenges in long context LLM inference.

May 15, 2024 · 2 min · Sherlock

From Softmax to FlashAttention

Deep dive into the mathematical foundations of flash attention, from softmax fundamentals to efficient kernel implementation.

March 20, 2024 · 8 min · Sherlock