But after searching for it, it turned out I was wrong: ffmpeg can't do this out of the box, and also can't read input from an EDL file [1].
So I set out to write a solution, as simple as possible: produce a list of select filters that are applied to video and audio, and retain only the non-silent parts.
The 'silencedetect' audio filter detects silences, and produces timestamps for when silences start and end. So one can pipe the output of that to a Python script, that builds an ffmpeg command with the appropriate select filters.
The resulting ffmpeg command can be run to re-encode the video without the silent parts.
Hope this helps.
- - -
Notes:
1/ I did find something somewhat similar to what I ended up doing [2], but it's complicated, and does the video encoding inside of Python. Granted, moviepy uses ffmpeg under the hood, but it's still better if ffmpeg does all the encoding directly.
2/ This solution doesn't produce a myriad of intermediary files that it would recombine afterwards; there is just one encoding of the output video file. 'silencedetect' is very fast, so all in all running this should be pretty fast.
[0] https://news.ycombinator.com/item?id=31767830
[1] https://en.wikipedia.org/wiki/Edit_decision_list
[2] https://donaldfeury.xyz/remove-the-silent-parts-of-a-video-u...