However, there has been little research on stable and automatic scaling of serverless LLM services in multi-GPU clusters. Challenges still exist in providing scheduling services so that LLM developers no longer need to spend time on providing stable and scalable LLM services.
To address this issue, I have constructed ENOVA, a deployment, monitoring, and automatic scaling service specifically for serverless LLM services. ENOVA thoroughly deconstructs the execution process of LLM services, and based on this, it has designed a configuration recommendation module for automatic deployment on any GPU cluster, as well as a performance monitoring module for automatic scaling. On top of them, ENOVA has implemented a deployment execution engine for multi-GPU cluster scheduling. Experimental results show that ENOVA significantly outperforms other state-of-the-art methods and is suitable for widespread deployment in large-scale online systems.
This is an open source for those who is interested in solving this issue. Let me know what you think of the solution to solve this issue.