Show HN: RepoGPT – question answering over repos using contextual chunking

2 points

3 years ago

Hello! RepoGPT is a tool I created that allows you to perform question answering over code repositories using contextual chunking. The backstory to this is I was trying out LangChain's demo for question answering over code repositories and it would often produce incorrect responses because the code chunks that files were getting split into were lacking context. For example, if there is a particularly long method implemented in a file, that method might be split across multiple code chunks. If the question being asked required the LLM to understand that method then all of those chunks need to be returned by the vector store - this might not necessarily happen if the query does not have a high enough semantic similarity to each chunk. Also even if the correct chunks are given to the LLM, it might not understand how they fit together. RepoGPT tries to fix this by adding contextual information to each chunk, including the file it came from, the line numbers the chunk spans, and most importantly the classes and methods that the chunk may be part of. The latter was done using python ast's for python code and py-tree-sitter for other types of files (languages other than python have not been tested much though). After playing around with it, it seems to be performing pretty well - particularly with gpt-4. It's really cool to see how the LLM is able understand entire methods that are split across multiple chunks based on the contextual information added. I really think this allows LLMs to answer a wider array of questions across repos. Please play around with it, instructions are in the README. I've tried adding local LLM's as well but those are very experimental right now.