8-bit KV Cache

This blog introduces KV Cache quantization in LLM inference.

January 24, 2024 · 8 min · Sherlock

SmoothQuant and AWQ

This blog post compares SmoothQuant and AWQ differences and their code implementation.

October 8, 2023 · 12 min · Sherlock

Triton Tutorial - Series 2

Build a high-performance matrix multiplication kernel in Triton that rivals cuBLAS performance with step-by-step optimization.

September 20, 2023 · 9 min · Sherlock

Triton Tutorial #2

third blogpost of triton tutorial series, gemm and autotune.

September 20, 2023 · 9 min · Sherlock

GPTQ Code Implementation

This blog post delved into the code implementation of the GPTQ quantization process, using the Llama model as a case study.

September 18, 2023 · 14 min · Sherlock

GPTQ Math Derivation

This blog post traces the development of GPTQ, starting from its roots in OBD, through OBS, and finally to OBC.

September 9, 2023 · 7 min · Sherlock

Triton Tutorial - Series 1

Learn to write a fused softmax kernel in Triton, with debugging and performance benchmarking techniques.

September 5, 2023 · 7 min · Sherlock

Triton Tutorial #1

second blogpost of triton tutorial series, fused softmax, debug and benchmarking it.

September 5, 2023 · 7 min · Sherlock

Triton Tutorial - Series 0

Introduction to Triton programming language, installation guide, and vector-addition example to get started with GPU kernel development.

September 2, 2023 · 5 min · Sherlock

Triton Tutorial #0

first blogpost of triton tutorial series, triton introduction, installation and vector-add example

September 2, 2023 · 5 min · Sherlock