for the CPU backend, I used only Eigen for linear algebra, nothing else.
for the GPU backend, I implemented my own custom matrix library in CUDA C++. The CUDA kernels aren’t optimized with shared memory, tiling, or fused ops (so there’s some kernel launch overhead), but I chose clarity, modularity, and reusability over a few milliseconds of speedup.
that said, I've taken care to ensure coalesced memory access, and it gives pretty solid performance, around 0.4 ms per epoch on MNIST (batch size = 1000) using an RTX 3060.
This project is a big step up from my previous one. It's cleaner, well-documented, and more modular.
I’m fully aware of areas that can be improved, and I’ll be working on them in future projects. My long-term goal is to get into Harvard or MIT, and this is part of that journey.
would love to hear your thoughts, suggestions, or feedback
ive attached the link to my GitHub Repo