We hand derived backpropagation steps, did some smart chained matrix multiplication bracketing, wrote all kernels in OpenAI’s Triton language, and applied lots of maths and coding trickery!
We have an open source version which finetunes Llama 2x faster and uses 50% less memory. Have a try at https://github.com/unslothai/unsloth. Any feedback would be appreciated! Discord: https://discord.gg/nsS4V5Z6ge