ViewTube

ViewTube
Sign inSign upSubscriptions
Filters

Upload date

Type

Duration

Sort by

Features

Reset

4,979 results

Related queries

speculative decoding

paged attention

llm kv cache

tensorrt llm

flash attention explained

multi-query attention

umar jamil

vllm

qlora

ollama

deepseek explained

3blue1brown

llm training

agentic ai

Julien Simon
Deep Dive: Optimizing LLM inference

Open-source LLMs are great for conversational applications, but they can be difficult to scale in production and deliver latency ...

36:12
Deep Dive: Optimizing LLM inference

42,729 views

1 year ago

AI Engineer
Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

LLM inference is not your normal deep learning model deployment nor is it trivial when it comes to managing scale, performance ...

33:39
Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

27,897 views

11 months ago

IBM Technology
Faster LLMs: Accelerate Inference with Speculative Decoding

Isaac Ke explains speculative decoding, a technique that accelerates LLM inference speeds by 2-4x without compromising output ...

9:39
Faster LLMs: Accelerate Inference with Speculative Decoding

17,949 views

6 months ago

YanAITalk
LLM inference optimization: Architecture, KV cache and Flash attention

... training cost so why do we focus on the inference optimization right so it has to be kind of come backward from the application of ...

44:06
LLM inference optimization: Architecture, KV cache and Flash attention

13,788 views

1 year ago

Red Hat
Optimize LLM inference with vLLM

Ready to serve your large language models faster, more efficiently, and at a lower cost? Discover how vLLM, a high-throughput ...

6:13
Optimize LLM inference with vLLM

7,627 views

5 months ago

DataCamp
Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works

In the last eighteen months, large language models (LLMs) have become commonplace. For many people, simply being able to ...

55:39
Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works

20,373 views

Streamed 1 year ago

Efficient NLP
Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

... References: LLM Inference Optimization blog post: https://lilianweng.github.io/posts/2023-01-10-inference-optimization/ How to ...

19:46
Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

56,713 views

2 years ago

PyTorch
Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA Understanding how to effectively size a production grade LLM ...

34:14
Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

21,854 views

1 year ago

IBM Technology
AI Inference: The Secret to AI's Superpowers

Download the AI model guide to learn more → https://ibm.biz/BdaJTb Learn more about the technology → https://ibm.biz/BdaJTp ...

10:41
AI Inference: The Secret to AI's Superpowers

101,772 views

1 year ago

People also watched

Nadav Timor
Accelerating LLM Inference with vLLM (and SGLang) - Ion Stoica

About the seminar: https://faster-llms.vercel.app Speaker: Ion Stoica (Berkeley & Anyscale & Databricks) Title: Accelerating LLM ...

1:00:54
Accelerating LLM Inference with vLLM (and SGLang) - Ion Stoica

6,293 views

9 months ago

Faradawn Yang
LLM Inference Optimization #2: Tensor, Data & Expert Parallelism (TP, DP, EP, MoE)

Part 2 of 5 in the “5 Essential LLM Optimization Techiniques” series. Link to the 5 techiniques roadmap: ...

20:18
LLM Inference Optimization #2: Tensor, Data & Expert Parallelism (TP, DP, EP, MoE)

1,488 views

2 months ago

Tales Of Tensors
KV Cache: The Trick That Makes LLMs Faster

KV Cache KV Cache Explained Large Language Model LLM Inference Optimization Transformer Model How to speed up LLMs ...

4:57
KV Cache: The Trick That Makes LLMs Faster

3,249 views

3 months ago

Graham Neubig
CMU LLM Inference (1): Introduction to Language Models and Inference

This lecture (by Graham Neubig) for CMU CS 11-763, Advanced NLP (Fall 2025) covers: What is a language model? What is an ...

1:13:27
CMU LLM Inference (1): Introduction to Language Models and Inference

2,303 views

3 months ago

MLOps.community
Efficiently Scaling and Deploying LLMs // Hanlin Tang // LLM's in Production Conference

Abstract Hanlin discusses the evolution of Large Language Models and the importance of efficient scaling and deployment.

25:14
Efficiently Scaling and Deploying LLMs // Hanlin Tang // LLM's in Production Conference

12,951 views

2 years ago

Daniel Bourke
A recipe for 50x faster local LLM inference | AI & ML Monthly

Welcome to machine learning & AI monthly for June 2025. This is the video version of the newsletter I write every month which ...

56:53
A recipe for 50x faster local LLM inference | AI & ML Monthly

8,255 views

5 months ago

Databricks
How to Build LLMs on Your Company’s Data While on a Budget

Large Language Models (LLMs) are taking AI mainstream across companies and individuals. However, public LLMs are trained ...

40:37
How to Build LLMs on Your Company’s Data While on a Budget

48,218 views

2 years ago

Trevor Spires
Optimize LLM Latency by 10x - From Amazon AI Engineer

Connect with me ▭▭▭▭▭▭ LINKEDIN ▻ / trevspires TWITTER ▻ / trevspires In this 7-minute tutorial, discover how to ...

13:25
Optimize LLM Latency by 10x - From Amazon AI Engineer

1,055 views

6 months ago

The ML Tech Lead!
How to Scale LLM Applications With Continuous Batching!

If you want to deploy an LLM endpoint, it is critical to think about how different requests are going to be handled. In typical ...

6:36
How to Scale LLM Applications With Continuous Batching!

3,358 views

1 year ago

Bijan Bowen
Run A Local LLM Across Multiple Computers! (vLLM Distributed Inference)

Timestamps: 00:00 - Intro 01:24 - Technical Demo 09:48 - Results 11:02 - Intermission 11:57 - Considerations 15:48 - Conclusion ...

16:45
Run A Local LLM Across Multiple Computers! (vLLM Distributed Inference)

23,636 views

1 year ago

Code to the Moon
Insanely Fast LLM Inference with this Stack

A walkthrough of some of the options developers are faced with when building applications that leverage LLMs. Includes ...

10:43
Insanely Fast LLM Inference with this Stack

9,628 views

3 months ago

CNCF [Cloud Native Computing Foundation]
Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kub... D. Gray

Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon Europe in London from April 1 - 4, 2025.

37:45
Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kub... D. Gray

2,007 views

1 year ago

IBM Technology
What is vLLM? Efficient AI Inference for Large Language Models

Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam ...

4:58
What is vLLM? Efficient AI Inference for Large Language Models

55,129 views

6 months ago

Faradawn Yang
AI Optimization Lecture 01 -  Prefill vs Decode - Mastering LLM Techniques from NVIDIA

Video 1 of 6 | Mastering LLM Techniques: Inference Optimization. In this episode we break down the two fundamental phases of ...

17:52
AI Optimization Lecture 01 - Prefill vs Decode - Mastering LLM Techniques from NVIDIA

7,888 views

6 months ago

NVIDIA Developer
Improving LLM Throughput via Data Center-Scale Inference Optimizations

Speaker: Maksim Khadkevich, Sr. Software Engineering Manager, Dynamo, NVIDIA Khadkevich discusses data center scale ...

17:24
Improving LLM Throughput via Data Center-Scale Inference Optimizations

878 views

12 days ago

AppliedAI
How Much GPU Memory is Needed for LLM Inference?

Discover a simple method to calculate GPU memory requirements for large language models like Llama 70B. Learn how the ...

5:28
How Much GPU Memory is Needed for LLM Inference?

959 views

1 year ago

Tales Of Tensors
Speculative Decoding: 3× Faster LLM Inference with Zero Quality Loss

Speculative decoding is one of the most important performance optimizations in modern LLM serving—and most people still don't ...

7:40
Speculative Decoding: 3× Faster LLM Inference with Zero Quality Loss

61 views

2 days ago

Databricks
Accelerating LLM Inference with vLLM

vLLM is an open-source highly performant engine for LLM inference and serving developed at UC Berkeley. vLLM has been ...

35:53
Accelerating LLM Inference with vLLM

23,735 views

1 year ago

Richard Aragon
Defeating Nondeterminism in LLM Inference Is Impossible

Link to Document: ...

31:11
Defeating Nondeterminism in LLM Inference Is Impossible

778 views

3 months ago

AI Engineer
How fast are LLM inference engines anyway? — Charles Frye, Modal

Open weights models and open source inference servers have made massive strides in the year since we last got together at AIE ...

16:07
How fast are LLM inference engines anyway? — Charles Frye, Modal

1,363 views

5 months ago