Ask HN: Does hashing only part of a file make sense as unique checksum?

1 point

4 years ago

Hey HN,

currently, I'm having a performance issue with my little side project `tonehub`[1]. It's a small Web API including a background indexer task for audio files in pretty early state.

Introduction: Sometimes I move an audio file to another directory, because the metadata changed. This results in losing all of the files non-metadata history (playback count, current position, playlists, etc.).

To overcome this, I implemented hashing via xxhash only for the audio-part of the file skipping the metadata part. If a file is indexed, but its location is not found in the database, it hashes the file, looks it up and if a unique match is present, it updates only the location keeping the history and releations.

Now to my problem: It's too slow. I have many audio book files in m4b format, most of the time bigger than 200MB and hashing a file like this takes pretty long, long enough that indexing a whole library feels to slow in my opinion.

So I thought about following alternatives to improve that:

  - Hashing only a fixed length part of of the file (e.g. 5MB around the midpoint position, because of intros and outros are often the same)
  
  - Hashing a percentage size part (e.g. 5% of the audio data size)
  
  - Combine one of these "partial" hashes with a size check (e.g. hash=0815471 + size=8340485bytes, because hash and size collision may be less likely)?

It feels like that won't work. So I ask HN:

  Would one of these alternatives be enough to avoid collisions? 

  If so, what would be a "sufficient" part of the file and which alternative is the best?

Thank you

[1] https://github.com/sandreas/tonehub

8 comments