I discuss methods, tradeoffs, and design patterns for accelerating inference of large language models with an eye towards memory management, latency, and throughput.

Introduction

Most Large Language Models (LLMs) of today are based on autoregressive transformer models. These models are more parallelizable than their ancestor recurrence-based and convolutional models.

Resources

  1. Accelerating Large Language Model Decoding With Speculative Sampling