speculative decoding

Lossless LLM inference acceleration with Speculators

Red Hat's Mark Kurtz and Megan Flynn examine speculative decoding, a technique that uses a smaller, faster model—the ...

29:48

Lossless LLM inference acceleration with Speculators

316 views

4 weeks ago

Tales Of Tensors

Speculative Decoding: 3× Faster LLM Inference with Zero Quality Loss

Speculative decoding is one of the most important performance optimizations in modern LLM serving—and most people still don't ...

7:40

Speculative Decoding: 3× Faster LLM Inference with Zero Quality Loss

47 views

1 day ago

Jordan Boyd-Graber

Why using a dumb language model can speed up a smarter one: Speculative Decoding [Lecture]

This is a single lecture from a course. If you you like the material and want more context (e.g., the lectures that came before), check ...

7:48

Why using a dumb language model can speed up a smarter one: Speculative Decoding [Lecture]

109 views

3 weeks ago

EleutherAI

ML Performance Reading Group Session 19: Speculative Decoding

Session covering an overview of speculative decoding and several seminal papers in the space, including Medusa, Eagle 1/2/3, ...

1:36:03

ML Performance Reading Group Session 19: Speculative Decoding

452 views

2 days ago

R

Speculative Decoding for Fast LLM Inference Algorithm explained in detail

This video introduces the playlist, explains the Speculative Decoding Algorithm from the paper https://arxiv.org/pdf/2211.17192 in ...

18:45

Speculative Decoding for Fast LLM Inference Algorithm explained in detail

18 views

5 days ago

Zaharah

The Secret to Faster LLMs: How Speculative Decoding Works

Why is generating text with LLMs so slow? It's not a compute problem, it's a memory bandwidth problem. In this video, we explore ...

7:06

The Secret to Faster LLMs: How Speculative Decoding Works

28 views

2 weeks ago

Doubleword

Behind the Stack, Ep. 13 - Faster Inference: Speculative Decoding for Batched Workloads

Speculative decoding is usually discussed as a way to make real time LLM APIs feel faster. But what happens when you apply it to ...

19:54

Behind the Stack, Ep. 13 - Faster Inference: Speculative Decoding for Batched Workloads

28 views

3 weeks ago

Uplatz

Modern LLM serving systems are built for scale, relying heavily on batching and parallelism to maximize GPU utilization. But as ...

8:08

Convergence of Concurrency: Batching and Speculative Decoding Conflict | Uplatz

0 views

18 hours ago

…をよむひと

AI's Speed Limit: Speculative Decoding EXPLAINED!

Hello everyone! Host Ni-no is back, exploring a trending article from the archives that answers a burning question: 'Is there a ...

9:22

AI's Speed Limit: Speculative Decoding EXPLAINED!

0 views

8 days ago

Entropic

NeuRIPS 2025: Conformal Sparsification for Bandwidth-Efficient Edge-Cloud Speculative Decoding

This video offers a formal and structured overview of the core ideas, methodology, and contributions of the work Conformal ...

8:40

NeuRIPS 2025: Conformal Sparsification for Bandwidth-Efficient Edge-Cloud Speculative Decoding

10 views

12 days ago

GOSIM Foundation

【GOSIM HANGZHOU 2025】Yikai Zhu, Lukec Wang：SpecForge - Speculative Decoding Model Training Framework

Full Title：Yikai Zhu, Lukec Wang：SpecForge: Open Source Framework for Training Speculative Decoding Models Speculative ...

17:51

【GOSIM HANGZHOU 2025】Yikai Zhu, Lukec Wang：SpecForge - Speculative Decoding Model Training Framework

3 views

2 weeks ago

Zaharah

Learn how Speculative Decoding speeds up inference with a draft + verify approach. @zarabux.

0:18

Speculative Decoding for Faster LLMs

0 views

2 days ago

AI Research Roundup

DEER: Diffusion Drafting for Faster LLMs

Instead of using an autoregressive drafter in speculative decoding, it uses a discrete diffusion LLM that can draft whole token ...

3:51

DEER: Diffusion Drafting for Faster LLMs

0 views

6 days ago

IgniteGTM

Inside LinkedIn’s AI Stack. Scaling GPUs, Agents, and Efficiency Across Every Layer

... networking, scheduling, observability, training optimization, fused kernels, distillation, RL efficiency, speculative decoding, and ...

19:30

Inside LinkedIn’s AI Stack. Scaling GPUs, Agents, and Efficiency Across Every Layer

0 views

2 weeks ago

NewestAIInovations

Notably, the architecture is shown to outperform both standard diffusion models and current state-of-the-art speculative decoding ...

6:40

NVIDIA TiDAR

0 views

3 weeks ago

AI Research Roundup

AdaSPEC: Selective KD for Faster LLM Spec Decoding

... Selective Knowledge Distillation for Efficient Speculative Decoders' This work tackles inefficiencies in speculative decoding, ...

3:42

AdaSPEC: Selective KD for Faster LLM Spec Decoding

0 views

2 weeks ago

PaperLens

NVIDIA TiDAR: 5.9x Faster LLM Inference! Diffusion Speed, AR Quality

TiDAR outperforms speculative decoding in throughput and surpasses diffusion models like Dream and Llada in efficiency and ...

5:56

NVIDIA TiDAR: 5.9x Faster LLM Inference! Diffusion Speed, AR Quality

165 views

4 weeks ago

임커밋

* Collaboration inquiries: commit.im@gmail.com (Please refrain from using personal emails.) The video animation was created ...

4:29

How Companies Save on LLM Serving Costs

10,117 views

3 weeks ago

Paper to Pod

5x Faster LLMs? How TiDAR Merges Diffusion and AR Architectures

*Beating Speculative Decoding:* Why TiDAR offers a more efficient alternative to current acceleration methods like speculative ...

7:24

5x Faster LLMs? How TiDAR Merges Diffusion and AR Architectures

54 views

2 weeks ago

AWS Events

AWS re:Invent 2025 - Sustainable and cost-efficient generative AI with agentic workflows (AIM333)

... scalable development, and optimization techniques like quantization and speculative decoding. Auto-scaling, batch processing, ...

54:06

AWS re:Invent 2025 - Sustainable and cost-efficient generative AI with agentic workflows (AIM333)

239 views

2 weeks ago

ViewTube