Monday, January 4, 2010

An email about identifying duplicate MP3s (etc)

How do I find duplicate music files?

Here's an idea for finding duplicate music files at the album level, that is, when the MP3s are organized from the start by album.

Order the MP3s by run time -- how many minutes and seconds they run. Using the run time accounts for the fact that there can be several versions of a song -- if you want to keep the versions (like, single vs album). It would work across formats, like MP3 vs WMA vs FLAC.

Compute the ratios (within some accuracy) of one to the next, and turn that into a hash that serves as a unique key for the album. #1 is 1.3 times longer than #2, which is 1.13 times longer than #3, which is 1.4 times longer than #4, etc Using the ratios might help with error estimating the run time; the time would be off systematically, we hope. It might be best to compute the hash from the 3 or longest, to avoid accumulating too much error. Experimentation would show that, maybe.

Given the hash, compare it to the other albums that have the same number of songs. If you find a match -- keep the bigger set, hoping for higher quality. Or always keep FLAC, if you like FLAC.

Can you determine the MP3 run time in C# code? I dunno. I saw a couple comments where people were struggling.

So figure it out by playing them! Write a hugely multi-threading app that simultaneously plays as many MP3s (or FLACs, etc) as your computer will manage. Use some tag-writing code to stash the run time in the ID3 tag, so you only have to play it once. Do the math: if you get (pick a number) 100 players going at once -- can you get through your collection before the energy death of the universe?

This might be helpful. This was impressive, but probably not so helpful, except conceptually maybe. Here's a way to play 'em. This enlists Windows Media Player to get the length.

No comments:

Post a Comment