Instead of running your neural networks either fully on cloud or fully on edge, why not split it between the two? Saves the cost of transporting all the weights to the edge, and saves cloud gpu inference costs, since they only have to infer on a subset of the weights.Essentially, building a load balancer which acts at the level of individual neural network weights. It must be completely asynchronous and does not affect user experience at all.