This experience led me to develop KnowLang (https://github.com/kimgb415/know-lang), an open-source tool aimed at helping engineers gain a comprehensive understanding of enterprise-level codebases.
The current version of KnowLang follows a straightforward technical approach:
1. Parse the code
2. Summarize the code
3. Index both code & summary into a vector database
4. Provide a RAG (Retrieval-Augmented Generation) chatbot interface
While this may not seem groundbreaking compared to existing RAG systems, I plan to focus on the following enhancements:
1. Developing code-specific RAG systems with code-focused reranking and embeddings (e.g., voyage-code-3)
2. Implementing an automatic LLM fine-tuning system to reduce sole reliance on RAG
3. Enabling inter-repository knowledge awareness through multiple depth/layers of RAG
I've published a blog post detailing KnowLang's architecture and performance on the Hugging Face blog: https://huggingface.co/blog/gabykim/knowlang-first-demo
You can try out KnowLang via:
- demo Hugging Face Space: https://huggingface.co/spaces/gabykim/KnowLang_Transformers_...
- Installing the package: `pip install knowlang`
- Exploring the GitHub repository: https://github.com/kimgb415/know-lang
I would love to hear your thoughts, feedback, and suggestions. Feel free to ask any questions – I'll be around to discuss and learn from the community.
Looking forward to your insights!
Gaby