Did it crash?
Has the loss saturated??
How’s the accuracy doing??
Should I cut my losses and start training with another set of hyperparams or let this experiment converge more??
Am I wasting GPU hours due to a dataloader bottleneck?
Given that the size of datasets and model params has only gotten bigger, the time to figure out a nice converging architecture has gotten longer.
Plus I am easily paying $5-$12 per hour for GPUs , and hence if there’s something fishy going on with my training sesh, I want to be notified about it as quickly as possible, even if I am at the gym, on the train or on a date
I also want to stop and start experiments remotely by changing a few hyper-params so that I waste no time.
What I want to know is how many others on here have felt a similar need. Idk maybe its just my workaholic arse?
So I built a dead simple app that lets you monitor your experiments from your iPhone. I am calling it Supamodel. Here is how it works:
1. Install the app from app store: https://apps.apple.com/us/app/supamodel/id6517357968
2. Sign in with Google or email/pwd
3. Click on the menu button top left, and copy the API key
4. use it to login to supamodel in code:
import supamodel supamodel.login("API_KEY")
Now you can log any metric using supamodel.log and it will show up in the app. It automatically logs GPU usage too.
You can also set alerts for the metrics you log so that if they fall below or exceed a certain number you get a push notification.
Give it a try, and if you have any questions or feedback please let me know in the comments.
link to python package: https://github.com/sarangzambare/supamodel-py
Thanks!