ViewTube

ViewTube
Sign inSign upSubscriptions
Filters

Upload date

Type

Duration

Sort by

Features

Reset

247,841 results

Related queries

paged attention

attention mechanism

attention is all you need

group query attention

positional encoding

multi head attention

vision transformer

transformer architecture

Jia-Bin Huang
How FlashAttention Accelerates Generative AI Revolution

FlashAttention is an IO-aware algorithm for computing attention used in Transformers. It's fast, memory-efficient, and exact.

11:54
How FlashAttention Accelerates Generative AI Revolution

23,460 views

1 year ago

Machine Learning Studio
FlashAttention: Accelerate LLM training

In this video, we cover FlashAttention. FlashAttention is an Io-aware attention algorithm that significantly accelerates the training of ...

11:27
FlashAttention: Accelerate LLM training

7,849 views

1 year ago

Stanford MLSys Seminars
FlashAttention - Tri Dao | Stanford MLSys #67

Episode 67 of the Stanford MLSys Seminar “Foundation Models Limited Series”! Speaker: Tri Dao Abstract: Transformers are slow ...

58:58
FlashAttention - Tri Dao | Stanford MLSys #67

38,221 views

Streamed 2 years ago

Stanford MedAI
MedAI #54: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | Tri Dao

Title: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Speaker: Tri Dao Abstract: Transformers are ...

47:47
MedAI #54: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | Tri Dao

20,213 views

3 years ago

GPU MODE
Lecture 80: How FlashAttention 4 Works

Speaker: Charles Frye The source code (in CuTe) for FlashAttention4 on Blackwell GPUs has recently been released for the ...

1:15:09
Lecture 80: How FlashAttention 4 Works

4,690 views

2 months ago

Unify
Flash Attention Explained

In this episode, we explore the Flash Attention algorithm with our esteemed guest speaker, Dan Fu, renowned researcher at ...

57:20
Flash Attention Explained

5,538 views

Streamed 2 years ago

Umar Jamil
Flash Attention derived and coded from first principles with Triton (Python)

In this video, I'll be deriving and coding Flash Attention from scratch. I'll be deriving every operation we do in Flash Attention using ...

7:38:18
Flash Attention derived and coded from first principles with Triton (Python)

70,574 views

1 year ago

Tales Of Tensors
Flash Attention: The Fastest Attention Mechanism?

This video explains FlashAttention-1, FlashAttention-2, and FlashAttention-3 in a clear, visual, step-by-step way. We look at why ...

8:43
Flash Attention: The Fastest Attention Mechanism?

614 views

4 weeks ago

People also watched

Pascal Poupart
CS480/680 Lecture 19: Attention and Transformer Networks

... talked about yet what is a transformer all I'm doing so far is just explaining in a general form what is the attention mechanism but ...

1:22:38
CS480/680 Lecture 19: Attention and Transformer Networks

366,875 views

6 years ago

learningcurve
Visualize the Transformers Multi-Head Attention in Action

We depict how a single layer Multi-Head Attention Network applies mathematical projections over Question-Answer data, ...

5:54
Visualize the Transformers Multi-Head Attention in Action

30,860 views

4 years ago

Niels Rogge
How a Transformer works at inference vs training time

I made this video to illustrate the difference between how a Transformer is used at inference time (i.e. when generating text) vs.

49:53
How a Transformer works at inference vs training time

68,536 views

2 years ago

Halfling Wizard
Attention Is All You Need - Paper Explained

In this video, I'll try to present a comprehensive study on Ashish Vaswani and his coauthors' renowned paper, “attention is all you ...

36:44
Attention Is All You Need - Paper Explained

128,509 views

4 years ago

Google Cloud Tech
Attention mechanism: Overview

This video introduces you to the attention mechanism, a powerful technique that allows neural networks to focus on specific parts ...

5:34
Attention mechanism: Overview

221,333 views

2 years ago

Discover AI
How to explain Q, K and V of Self Attention in Transformers (BERT)?

How to explain Q, K and V of Self Attention in Transformers (BERT)? Thought about it and present here my most general approach ...

15:06
How to explain Q, K and V of Self Attention in Transformers (BERT)?

16,282 views

3 years ago

Machine Learning Studio
Variants of Multi-head attention: Multi-query (MQA) and Grouped-query attention (GQA)

Explore the intricacies of Multihead Attention variants: Multi-Query Attention (MQA) and Grouped-Query Attention (GQA).

8:13
Variants of Multi-head attention: Multi-query (MQA) and Grouped-query attention (GQA)

12,281 views

2 years ago

AI Papers Academy
LLM in a flash: Efficient Large Language Model Inference with Limited Memory

In this video we review a recent important paper from Apple, titled: "LLM in a flash: Efficient Large Language Model Inference with ...

6:28
LLM in a flash: Efficient Large Language Model Inference with Limited Memory

4,754 views

2 years ago

Sachin Kalsi
ELI5 FlashAttention: Fast & Efficient Transformer Training - part 2

In this exciting series, join me as I delve deep into the revolutionary FlashAttention technique - a game-changer that supercharges ...

39:17
ELI5 FlashAttention: Fast & Efficient Transformer Training - part 2

3,487 views

2 years ago

MLOps.community
Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

Join the MLOps Community here: mlops.community/join // Abstract Getting the right LLM inference stack means choosing the right ...

30:25
Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

26,678 views

2 years ago

GPU MODE
How FlashAttention 4 Works

Speaker: Charles Frye From the Modal team: https://modal.com/blog/reverse-engineer-flash-attention-4.

1:15:09
How FlashAttention 4 Works

3,729 views

Streamed 2 months ago

3Blue1Brown
Attention in transformers, step-by-step | Deep Learning Chapter 6

Demystifying attention, the key mechanism inside transformers and LLMs. Instead of sponsored ad reads, these lessons are ...

26:10
Attention in transformers, step-by-step | Deep Learning Chapter 6

3,505,240 views

1 year ago

Martin Is A Dad
FlashAttention V1 Deep Dive By Google Engineer | Fast and Memory-Efficient LLM Training

... #flash #maths #machinelearning 0:00 Intro 0:56 CPU and GPU Memory Hierarchy 4:29 Standard Attention 8:26 Flash Attention ...

34:38
FlashAttention V1 Deep Dive By Google Engineer | Fast and Memory-Efficient LLM Training

1,516 views

7 months ago

Efficient NLP
The KV Cache: Memory Usage in Transformers

Try Voice Writer - speak your thoughts and let AI handle the grammar: https://voicewriter.io The KV cache is what takes up the bulk ...

8:33
The KV Cache: Memory Usage in Transformers

91,164 views

2 years ago

Data Science in your pocket
What is Flash Attention?

This video explains an advancement over the Attention mechanism used in LLMs (Attention is all you need) , Flash Attention ...

6:03
What is Flash Attention?

3,048 views

1 year ago

Julien Simon
Deep dive - Better Attention layers for Transformer models

... namely Multi-Query Attention, Group-Query Attention, Sliding Window Attention, Flash Attention v1 and v2, and Paged Attention.

40:54
Deep dive - Better Attention layers for Transformer models

14,637 views

1 year ago

GPU MODE
Lecture 12: Flash Attention

Uh so I'm short selling you a bit if you wanted to have live coding of the fastest flash attention Kel uh because I was a bit foolishly ...

1:12:14
Lecture 12: Flash Attention

7,148 views

1 year ago

GPU MODE
Lecture 36: CUTLASS and Flash Attention 3

Speaker: Jay Shah Slides: https://github.com/cuda-mode/lectures Correction by Jay: "It turns out I inserted the wrong image for the ...

1:49:16
Lecture 36: CUTLASS and Flash Attention 3

8,864 views

1 year ago

Aleksa Gordić - The AI Epiphany
Flash Attention 2.0 with Tri Dao (author)! | Discord server talks

Become The AI Epiphany Patreon ❤️ https://www.patreon.com/theaiepiphany ‍ ‍ ‍ Join our Discord community ...

1:00:25
Flash Attention 2.0 with Tri Dao (author)! | Discord server talks

22,637 views

2 years ago

Martin Is A Dad
FlashAttention V2 Explained By Google Engineer | Train LLM With Better Parallelism

Slides are available at https://martinisadad.github.io/ We already know from first episode that FlashAttention results in 2~4X times ...

18:16
FlashAttention V2 Explained By Google Engineer | Train LLM With Better Parallelism

1,255 views

6 months ago

Stephen Blum
Flash Attention Machine Learning

Flash attention aims to boost the performance of language models and transformers by creating an efficient pipeline to transform ...

25:34
Flash Attention Machine Learning

6,914 views

1 year ago