Show HN: Open-source code search with OpenAI's function calling

19 points

3 years ago

We're excited to share a tool we've been working on called gpt-code-search. It allows you to search any codebase using natural language locally on your machine. We leverage OpenAI's GPT-4 and function calling to retrieve, search, and answer queries about your code.

All you need to do is to install the package with `pip install gpt-code-search`, set up your `OPENAI_API_KEY` as an environment variable, and start asking questions with `gpt-code-search query <your question>`.

E.g. You can ask questions like "How do I use the analytics module?" or "Document all the API routes related to authentication."

This is still early and hacked together in the past week, but we wanted to get it out there and get feedback.

We utilize OpenAI's function calling to let GPT-4 call certain predefined functions in our library. You do not need to implement any of these functions yourself. These functions are designed to interact with your codebase and return enough context for the LLM to perform code searches without pre-indexing it or uploading your repo to a third party other than OpenAI. So, you only need to run the tool from the directory you want to search.

The functions currently available for the LLM to call are:

`search_codebase` - searches the codebase using a TF-IDF vectorizer

`get_file_tree` - provides the file tree of the codebase

`get_file_contents` - provides the contents of a file

These functions are implemented in `gpt-code-search` and are triggered by chat completions. The LLM is prompted to utilize the search_codebase and get_file_tree function as needed to find the necessary context to answer your query and then loops as needed to collect more context with the get_file_contents until the LLM responds.

A couple of limitations of this approach, GPT cannot load context across multiple files in a single prompt since we are passing in the contents of a single file in each function call. So, GPT repeatedly calls the get_file_contents function to load context from multiple files. This increases the latency and cost of the tool.

Another thing we realized as we were building is that the level of search and retrieval is limited by the context window, which refers to the scope of the search conducted by the tool, meaning that we can only search five levels deep in the file system and can only pass in the contents of one file at a time. So it would be best to run the tool from the package/directory closest to the code you want to search.

We plan to add support for local vector embeddings to improve search and retrieval. Combining the vector embeddings with function calling should result in much faster and higher quality results.

Also, support for other models, chat interactions in the command line, and generating code is already on our backlog!

Please check out gpt-code-search and let me know your thoughts, feedback, or suggestions.

7 comments