🌊 sushant's knowledge ocean

Recent Notes

Text-to-Qdrant: A Natural Language-first Semantic Query Layer
Jun 30, 2025
- resource
AI Interpretability
Jun 05, 2025
- resource
Garmin Epix Pro Sapphire
Jun 05, 2025
- resource
Guide to investing
Jun 05, 2025
Machu Picchu Trip May 2025
Jun 05, 2025
- resource
Objectives and Key Results (OKRs)
Jun 05, 2025
- resource

See 76 more →

Introduction
Resources

Home

❯

notes

❯

Accelerating LLM Inference Tradeoffs, Design, and New Ideas

Accelerating LLM Inference - Tradeoffs, Design, and New Ideas

Feb 25, 20251 min read

I discuss methods, tradeoffs, and design patterns for accelerating inference of large language models with an eye towards memory management, latency, and throughput.

Introduction

Most Large Language Models (LLMs) of today are based on autoregressive transformer models. These models are more parallelizable than their ancestor recurrence-based and convolutional models.

Resources

Accelerating Large Language Model Decoding With Speculative Sampling

Graph View

Introduction
Resources

GitHub
LinkedIn
Twitter

🌊 sushant's knowledge ocean

Recent Notes

Text-to-Qdrant: A Natural Language-first Semantic Query Layer

AI Interpretability

Garmin Epix Pro Sapphire

Guide to investing

Machu Picchu Trip May 2025

Objectives and Key Results (OKRs)

Table of Contents

Accelerating LLM Inference - Tradeoffs, Design, and New Ideas

Introduction

Resources

Graph View

Table of Contents