W4A8KV4 Quantization Summary and Best Practices

Comprehensive summary of W4A8KV4 quantization techniques, covering KV4 and W4A8 optimization methods with practical recommendations.

August 30, 2024 · 6 min · Sherlock

Speculative Sampling for Faster LLM Inference

Deep dive into speculative sampling techniques for accelerating LLM inference through draft model prediction and rejection sampling.

June 20, 2024 · 3 min · Sherlock

Triton Tutorial - Series 1

Learn to write a fused softmax kernel in Triton, with debugging and performance benchmarking techniques.

September 5, 2023 · 7 min · Sherlock