currently, I'm having a performance issue with my little side project `tonehub`[1]. It's a small Web API including a background indexer task for audio files in pretty early state.
Introduction: Sometimes I move an audio file to another directory, because the metadata changed. This results in losing all of the files non-metadata history (playback count, current position, playlists, etc.).
To overcome this, I implemented hashing via xxhash only for the audio-part of the file skipping the metadata part. If a file is indexed, but its location is not found in the database, it hashes the file, looks it up and if a unique match is present, it updates only the location keeping the history and releations.
Now to my problem: It's too slow. I have many audio book files in m4b format, most of the time bigger than 200MB and hashing a file like this takes pretty long, long enough that indexing a whole library feels to slow in my opinion.
So I thought about following alternatives to improve that:
- Hashing only a fixed length part of of the file (e.g. 5MB around the midpoint position, because of intros and outros are often the same)
- Hashing a percentage size part (e.g. 5% of the audio data size)
- Combine one of these "partial" hashes with a size check (e.g. hash=0815471 + size=8340485bytes, because hash and size collision may be less likely)?
It feels like that won't work. So I ask HN: Would one of these alternatives be enough to avoid collisions?
If so, what would be a "sufficient" part of the file and which alternative is the best?
Thank you[1] https://github.com/sandreas/tonehub