Show HN: Automatic chaptering – From raw transcripts to structured documents

5 points

2 years ago

Hey HN,

I have been working in the last weeks on the task of automatically extracting a table of contents from a raw (audio or video) transcript, aka a 'chaptering' task.

That turned out to be more difficult than I inially thought, especially because I needed to keep the timestamp data, and because I had to deal with long transcripts, and LLMs tend to 'forget' part of input data when it is too long.

I was also surprised that I could not find any open-source solution for that, in standard libraries like Langchain or LLamaIndex, despite the wide range of possible use cases (text summarization, referencing, sinformation retrieval in RAGs, ...).

I therefore ended up putting together a workflow that turns out to work pretty well. It relies on LLMs for different language processing subtasks (text formatting, paragraph structuring, chapter segmentation and title generation), and on a TF-IDF statistics to add timestamps back after the paragraph structuring.

I just open-sourced the code and made a summary of the methodology here: https://medium.com/@ya-lb/automate-video-chaptering-with-llm...

If you know any other tool or open-source library that allows similar automatic chaptering of text, please share, I'd be glad to know about them!