llm inference optimization

Open-source LLMs are great for conversational applications, but they can be difficult to scale in production and deliver latency ...

36:12

Deep Dive: Optimizing LLM inference

42,729 views

1 year ago

AI Engineer

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

LLM inference is not your normal deep learning model deployment nor is it trivial when it comes to managing scale, performance ...

33:39

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

27,897 views

11 months ago

IBM Technology

Faster LLMs: Accelerate Inference with Speculative Decoding

Isaac Ke explains speculative decoding, a technique that accelerates LLM inference speeds by 2-4x without compromising output ...

9:39

Faster LLMs: Accelerate Inference with Speculative Decoding

17,949 views

6 months ago

YanAITalk

LLM inference optimization: Architecture, KV cache and Flash attention

... training cost so why do we focus on the inference optimization right so it has to be kind of come backward from the application of ...

44:06

LLM inference optimization: Architecture, KV cache and Flash attention

13,788 views

1 year ago

Red Hat

Ready to serve your large language models faster, more efficiently, and at a lower cost? Discover how vLLM, a high-throughput ...

6:13

Optimize LLM inference with vLLM

7,627 views

5 months ago

DataCamp

Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works

In the last eighteen months, large language models (LLMs) have become commonplace. For many people, simply being able to ...

55:39

Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works

20,373 views

Streamed 1 year ago

Efficient NLP

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

... References: LLM Inference Optimization blog post: https://lilianweng.github.io/posts/2023-01-10-inference-optimization/ How to ...

19:46

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

56,713 views

2 years ago

PyTorch

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA Understanding how to effectively size a production grade LLM ...

34:14

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

21,854 views

1 year ago

IBM Technology

AI Inference: The Secret to AI's Superpowers

Download the AI model guide to learn more → https://ibm.biz/BdaJTb Learn more about the technology → https://ibm.biz/BdaJTp ...

10:41

AI Inference: The Secret to AI's Superpowers

101,772 views

1 year ago

Code to the Moon

Insanely Fast LLM Inference with this Stack

A walkthrough of some of the options developers are faced with when building applications that leverage LLMs. Includes ...

10:43

Insanely Fast LLM Inference with this Stack

9,628 views

3 months ago

CNCF [Cloud Native Computing Foundation]

Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kub... D. Gray

Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon Europe in London from April 1 - 4, 2025.

37:45

Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kub... D. Gray

2,007 views

1 year ago

IBM Technology

What is vLLM? Efficient AI Inference for Large Language Models

Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam ...

4:58

What is vLLM? Efficient AI Inference for Large Language Models

55,129 views

6 months ago

Faradawn Yang

AI Optimization Lecture 01 - Prefill vs Decode - Mastering LLM Techniques from NVIDIA

Video 1 of 6 | Mastering LLM Techniques: Inference Optimization. In this episode we break down the two fundamental phases of ...

17:52

AI Optimization Lecture 01 - Prefill vs Decode - Mastering LLM Techniques from NVIDIA

7,888 views

6 months ago

NVIDIA Developer

Improving LLM Throughput via Data Center-Scale Inference Optimizations

Speaker: Maksim Khadkevich, Sr. Software Engineering Manager, Dynamo, NVIDIA Khadkevich discusses data center scale ...

17:24

Improving LLM Throughput via Data Center-Scale Inference Optimizations

878 views

12 days ago

AppliedAI

How Much GPU Memory is Needed for LLM Inference?

Discover a simple method to calculate GPU memory requirements for large language models like Llama 70B. Learn how the ...

5:28

How Much GPU Memory is Needed for LLM Inference?

959 views

1 year ago

Tales Of Tensors

Speculative Decoding: 3× Faster LLM Inference with Zero Quality Loss

Speculative decoding is one of the most important performance optimizations in modern LLM serving—and most people still don't ...

7:40

Speculative Decoding: 3× Faster LLM Inference with Zero Quality Loss

61 views

2 days ago

Databricks

vLLM is an open-source highly performant engine for LLM inference and serving developed at UC Berkeley. vLLM has been ...

35:53

Accelerating LLM Inference with vLLM

23,735 views

1 year ago

Richard Aragon

Defeating Nondeterminism in LLM Inference Is Impossible

Link to Document: ...

31:11

Defeating Nondeterminism in LLM Inference Is Impossible

778 views

3 months ago

AI Engineer

How fast are LLM inference engines anyway? — Charles Frye, Modal

Open weights models and open source inference servers have made massive strides in the year since we last got together at AIE ...

16:07

How fast are LLM inference engines anyway? — Charles Frye, Modal

1,363 views

5 months ago

ViewTube

People also watched

ViewTube

Related queries

People also watched