Upload date
All time
Last hour
Today
This week
This month
This year
Type
All
Video
Channel
Playlist
Movie
Duration
Short (< 4 minutes)
Medium (4-20 minutes)
Long (> 20 minutes)
Sort by
Relevance
Rating
View count
Features
HD
Subtitles/CC
Creative Commons
3D
Live
4K
360°
VR180
HDR
4,979 results
speculative decoding
paged attention
llm kv cache
tensorrt llm
flash attention explained
multi-query attention
umar jamil
vllm
qlora
ollama
deepseek explained
3blue1brown
llm training
agentic ai
Open-source LLMs are great for conversational applications, but they can be difficult to scale in production and deliver latency ...
42,729 views
1 year ago
LLM inference is not your normal deep learning model deployment nor is it trivial when it comes to managing scale, performance ...
27,897 views
11 months ago
Isaac Ke explains speculative decoding, a technique that accelerates LLM inference speeds by 2-4x without compromising output ...
17,949 views
6 months ago
... training cost so why do we focus on the inference optimization right so it has to be kind of come backward from the application of ...
13,788 views
Ready to serve your large language models faster, more efficiently, and at a lower cost? Discover how vLLM, a high-throughput ...
7,627 views
5 months ago
In the last eighteen months, large language models (LLMs) have become commonplace. For many people, simply being able to ...
20,373 views
Streamed 1 year ago
... References: LLM Inference Optimization blog post: https://lilianweng.github.io/posts/2023-01-10-inference-optimization/ How to ...
56,713 views
2 years ago
Understanding the LLM Inference Workload - Mark Moyou, NVIDIA Understanding how to effectively size a production grade LLM ...
21,854 views
Download the AI model guide to learn more → https://ibm.biz/BdaJTb Learn more about the technology → https://ibm.biz/BdaJTp ...
101,772 views
About the seminar: https://faster-llms.vercel.app Speaker: Ion Stoica (Berkeley & Anyscale & Databricks) Title: Accelerating LLM ...
6,293 views
9 months ago
Part 2 of 5 in the “5 Essential LLM Optimization Techiniques” series. Link to the 5 techiniques roadmap: ...
1,488 views
2 months ago
KV Cache KV Cache Explained Large Language Model LLM Inference Optimization Transformer Model How to speed up LLMs ...
3,249 views
3 months ago
This lecture (by Graham Neubig) for CMU CS 11-763, Advanced NLP (Fall 2025) covers: What is a language model? What is an ...
2,303 views
Abstract Hanlin discusses the evolution of Large Language Models and the importance of efficient scaling and deployment.
12,951 views
Welcome to machine learning & AI monthly for June 2025. This is the video version of the newsletter I write every month which ...
8,255 views
Large Language Models (LLMs) are taking AI mainstream across companies and individuals. However, public LLMs are trained ...
48,218 views
Connect with me ▭▭▭▭▭▭ LINKEDIN ▻ / trevspires TWITTER ▻ / trevspires In this 7-minute tutorial, discover how to ...
1,055 views
If you want to deploy an LLM endpoint, it is critical to think about how different requests are going to be handled. In typical ...
3,358 views
Timestamps: 00:00 - Intro 01:24 - Technical Demo 09:48 - Results 11:02 - Intermission 11:57 - Considerations 15:48 - Conclusion ...
23,636 views
A walkthrough of some of the options developers are faced with when building applications that leverage LLMs. Includes ...
9,628 views
Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon Europe in London from April 1 - 4, 2025.
2,007 views
Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam ...
55,129 views
Video 1 of 6 | Mastering LLM Techniques: Inference Optimization. In this episode we break down the two fundamental phases of ...
7,888 views
Speaker: Maksim Khadkevich, Sr. Software Engineering Manager, Dynamo, NVIDIA Khadkevich discusses data center scale ...
878 views
12 days ago
Discover a simple method to calculate GPU memory requirements for large language models like Llama 70B. Learn how the ...
959 views
Speculative decoding is one of the most important performance optimizations in modern LLM serving—and most people still don't ...
61 views
2 days ago
vLLM is an open-source highly performant engine for LLM inference and serving developed at UC Berkeley. vLLM has been ...
23,735 views
Link to Document: ...
778 views
Open weights models and open source inference servers have made massive strides in the year since we last got together at AIE ...
1,363 views