Ruby Script: Catalog
About a couple of years ago, I bought a 500 GB hard drive for file backups. I had started ripping MP3s from my CD collection as well as storing all of my digital photographs and didn’t want to lose any of them. Plus, I had years worth of various code files in various languages scattered hither and yon. Finally, the motherboard on my old Sony Vaio laptop burned to a crisp, orphaning my 60 GB drive and all of its data. So, with all these files everywhere, including some temporary backup space on my work computer, I bought the 500 gigger to store them all.
When I stored them, I essentially dumped them in tangled mass of folders. I have a folder named sonydisk for files taken from the Vaio’s orphaned hard drive. I have a folder named hpdisk for files taken from an older HP laptop. I have a folder named CRoot, for C:\, for files taken from yet another system. (The two laptops were running Linux.)
So, it is very possible that there are duplicate files on my backup drive. Plus, there may be different versions of the same file floating around as well. So, how do I clean up this mess?
I had written a Ruby script called fsynch.rb which could walk through a folder tree on two drives and make sure that missing and/or older copies of files and folders would be copied to the destination drive. I used this script periodically to copy music files from both “My Music” (originally ripped via RealPlayer) and “iTunes Music” to E:\Music. I used the core of this script to build a script called catalog.rb which would walk through a folder tree and look for duplicate filenames.
The results were impressive. It scanned through over 65,000 file listings in over 4,000 folders in 6 seconds. When I ran it against E:/Music alone, it found 3722 files in 452 folders. It found numerous duplicate file entries. When I listed the duplicate folders for each of these entries, I noted studio and live album versions, and remixes, but I also found some ripping and folder naming errors. Frex, I found duplicate MP3s in E:/Music/Out Of Time and in E:/Music/R.E.M./Out Of Time. I also found some “Untitled -
So, now that I can find the duplicate filenames, I can make some decisions and prune some dead, redundant, erroneous folder trees on my backup drive