Efficient AI · Stanford University
FlashAttention: The Attention Speedup That Came From Reading GPU Memory
FlashAttention keeps attention exact but makes it IO-aware, using tiling to reduce slow GPU memory traffic and make long-sequence Transformers faster and cheaper.