What does HackerNews think of flash-attention?

Fast and memory-efficient exact attention

Language: Python

I wonder how this compares to Flash Attention (https://github.com/HazyResearch/flash-attention), which is the other "memory aware" Attention project I'm aware of.

I guess Flash Attention is more about utilizing memory GPU SRam correctly, where this is more about using the OS/CPU memory better?

In 2016 Transformers didn't exist and the state of the art for neural network based NLP was using LSTMs that had a limit of maybe 100 words at most.

With new implementations like xformers[1] and flash attention[2] it is unclear where the length limit is on modern transformer models.

Flash Attention can currently scale up to 64,000 tokens on an A100.

[1] https://github.com/facebookresearch/xformers/blob/main/HOWTO...

[2] https://github.com/HazyResearch/flash-attention