I discuss methods, tradeoffs, and design patterns for accelerating inference of large language models with an eye towards memory management, latency, and throughput.
Introduction
Most Large Language Models (LLMs) of today are based on autoregressive transformer models. These models are more parallelizable than their ancestor recurrence-based and convolutional models.