What does HackerNews think of flash-attention?
Fast and memory-efficient exact attention
Language:
Python
I wonder how this compares to Flash Attention (https://github.com/HazyResearch/flash-attention), which is the other "memory aware" Attention project I'm aware of.
I guess Flash Attention is more about utilizing memory GPU SRam correctly, where this is more about using the OS/CPU memory better?
FlashAttention has memory linear in sequence length.
https://github.com/HazyResearch/flash-attention
In 2016 Transformers didn't exist and the state of the art for neural network based NLP was using LSTMs that had a limit of maybe 100 words at most.
With new implementations like xformers[1] and flash attention[2] it is unclear where the length limit is on modern transformer models.
Flash Attention can currently scale up to 64,000 tokens on an A100.
[1] https://github.com/facebookresearch/xformers/blob/main/HOWTO...