Show HN: I found a way to increase the perforamce of GPU cluster scheduling

2 points

2 years ago

I worked on this project one year ago since I started working on developing new programs by using GPT and GPU could. And I found out these applications typically run on GPU clouds in industrial environments, where the cost of LLM requests may be ten times higher than that of traditional queries. Therefore, previous researchers have focused on optimizing LLM inference to increase throughput and reduce the cost of LLM services. A variety of methods have been proposed to optimize LLM inference, which can be categorized into five general categories: computation, weight or activation quantization, memory management, kernel operation optimization, and batch request processing.

However, there has been little research on stable and automatic scaling of serverless LLM services in multi-GPU clusters. Challenges still exist in providing scheduling services so that LLM developers no longer need to spend time on providing stable and scalable LLM services.

To address this issue, I have constructed ENOVA, a deployment, monitoring, and automatic scaling service specifically for serverless LLM services. ENOVA thoroughly deconstructs the execution process of LLM services, and based on this, it has designed a configuration recommendation module for automatic deployment on any GPU cluster, as well as a performance monitoring module for automatic scaling. On top of them, ENOVA has implemented a deployment execution engine for multi-GPU cluster scheduling. Experimental results show that ENOVA significantly outperforms other state-of-the-art methods and is suitable for widespread deployment in large-scale online systems.

This is an open source for those who is interested in solving this issue. Let me know what you think of the solution to solve this issue.

1 comment