Is training DL models in the cloud too expensive?

4 points

9 years ago

Does anyone here have experience training models with Google's Cloud ML? We're currently training a model based off Food-2000 that takes about 5 days using a single K80 on a local machine. I'd like to estimate doing this faster using Google Cloud ML.

My estimates use the pricing located here: https://cloud.google.com/ml-engine/pricing#machine_types_for_custom_cluster_configurations

Cost = (ML training units * cost per unit / 60) * job duration in minutes

The "ML training units" for a standard_gpu is 3 and for a complex_model_m_gpu is 12. I'm assuming a standard_gpu is equivalent to a single GPU on the K80 (which has two GPUs). So my assumption's that a complex_model_m_gpu is 4x more expensive because it's equivalent to 2 x K80s.

The "cost per unit" in the US is $0.49 per hour. And since I'm training with 2 x K80s in the cloud now, my training should be closer to 2.5 days which is 60 hours.

Cost = 12 * $0.59 * 60 = $425. Given that a K80 costs $4,000 on Amazon, it would take 18.8 training runs to match the price of 2 x K80s. But we ran multiple experiments to fine tune our model to this point so likely went way past 18.8 training runs total. Maybe running in the cloud is too expensive?

2 comments