I recently found myself knee-deep in a fascinating challenge that I think you'll find intriguing. I'm a backend developer with a penchant for Node.js and SQL, and I've been wrestling with a problem that's both technically intricate and surprisingly common in the world of open-source development.
The issue at hand? Duplicate GitHub issues. They're like weeds in a garden – you turn your back for a second, and suddenly they're everywhere, choking out the valuable discussions and dragging down productivity.
So, I set out to build a bot. Not just any bot, mind you, but a Probot GitHub app that could intelligently detect and flag duplicates before they multiplied. It was a problem that scratched my own itch, and I figured if it worked for me, it might just work for others too.
The idea was simple enough: whenever a new issue is opened, the bot would compare it against existing issues using OpenAI's embeddings to measure textual similarity. But the simplicity was deceptive. I needed to store and query embeddings efficiently, which led me down the rabbit hole of Supabase and its vector extension. I had to learn the nuances of vector similarity, and let me tell you, it's a topic that can get as thorny as the problem it's trying to solve.
As I iterated on the solution, I discovered nuances I hadn't anticipated. The importance of metadata like issue labels, repo id, issue id and so on, for example. The need to account The fact that the bot needed to be able to handle multiple repos, each with their own set of issues. The list goes on, so I focused on the simplest possible solution that could still be effective. I wanted to build something that would be useful to the community, not just a technical exercise.
I'm proud of the bot I built. its solves the task of detecting duplicates with a high degree of accuracy, and it's open-source so anyone can use it. Also if you have existing issues the sync script will sync them to the database so you can start detecting duplicates right away.
For those who love the gritty details: I leveraged TypeScript for its type safety, wrangled with SQL functions to handle vector operations like cosine similarity, OpenAI's API to generate embeddings and the final comment in the issue thread, I also used GitHub's REST API to fetch issues and comments.
It's not perfect – no bot ever is – but it's a start. I'm curious to hear what you think, and I'd love to hear about your own experiences with duplicate issues. What challenges have you faced? What solutions have you tried? What would you like to see in a bot like this?