There's a custom backprop engine which reduces actual FLOPs, and all kernels are written in OpenAI's Triton language to reduce data movement.
Also have an 2x faster inference only notebook in a free Colab as well! https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854...