What does HackerNews think of flash-attention?

Fast and memory-efficient exact attention

Language: Python

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | Jun 2023

I wonder how this compares to Flash Attention (https://github.com/HazyResearch/flash-attention), which is the other "memory aware" Attention project I'm aware of.

I guess Flash Attention is more about utilizing memory GPU SRam correctly, where this is more about using the OS/CPU memory better?

Hacking Around ChatGPT’s Character Limits with the Code Interpreter | May 2023

Expand Context ↕

FlashAttention has memory linear in sequence length. https://github.com/HazyResearch/flash-attention

Turing Machines Are Recurrent Neural Networks (1996) | Dec 2022

Expand Context ↕

In 2016 Transformers didn't exist and the state of the art for neural network based NLP was using LSTMs that had a limit of maybe 100 words at most.

With new implementations like xformers[1] and flash attention[2] it is unclear where the length limit is on modern transformer models.

Flash Attention can currently scale up to 64,000 tokens on an A100.

[1] https://github.com/facebookresearch/xformers/blob/main/HOWTO...

[2] https://github.com/HazyResearch/flash-attention

Flash Attention accelerates GPT2 3x | Jun 2022

https://github.com/HazyResearch/flash-attention