Monday, February 15, 2010

Losing My Memories

Back at the beginning of January a horrible thing happened. It was something that a lot of people fear in this day and age, but which few really believe will happen to them. It happened to me though, and I had to find a way to recover from it. Yes, I lost all of my digital photographs.

The the complete details of how it happened are not terribly germane to the post but the short story is that, while moving to a new computer, for that brief period where a lot of this data existed only on my backup disk, a Windows installer decided it would like to reformat that backup disk for me.

Recovering data from a reformatted drive can be tricky. Without the original filesystem information you need some special tools to even find old files, let alone reassemble them into something recognizable. But, with a bit of work I managed to get all of my images back, and this is the story of how I did that.

The whole recovery story started off with a stroke of luck. I happened to mention the demise of all of my photographs to a friend of mine, and he just happened to know of an incredibly useful tool for recovering my data. He pointed me toward TestDisk by Christophe Grenier. TestDisk is rather badly named I think, because testing is the least of what it can do. One of the key features that made my life far, far easier is its ability to do file type recognition when recovering files.

When the filesystem information from a disk is lost, even if you're able to recover files, you can't always recover the file names; often that information is lost forever. That means that recovered files will typically wind up with some sort of coded file name (usually just a number generated by the recovery program). If you're recovering a very large disk, you can wind up with literally millions of files with completely nondescript file names. It would be completely impractical to try and sort through an entire disk worth of files that way trying to find the pictures.

Fortunately, TestDisk's ability to recognize file types based on the data in the file, rather than the file name, meant that I could tell it to only recover the JPEG images from the disk. This way I wound up with a set of files where I definitely knew the type of each and every file. And it just so happens that all of the digital cameras I've owned work in JPEG.

I knew I was still going to have a problem though. Because this was my backup disk, which contained not only my Aperture database, but also all of my Time Machine data (a MacOS backup tool), what would be recovered in searching for all JPEG images would include all of the pictures in my Aperture database, but also my entire web browser cache, and any other little jpeg images stored on my disk as part of various applications, etc. When the recovery ran, I ended up with a bunch of folders with a little under 35,000 pictures in them. Now what?

Well, the first thing I did was to try to eliminate any duplicate images. Even though that would be a fairly simple script to write, I always google for these sorts of tools before I try to write them myself. Usually, someone else has already written and posted the thing I need, and often it's better than I would have written on the first try. This was just such a case, and I found a great little perl script that would search for and remove all the duplicate images.

That got me down to a little over 20,000 images. Still a lot, but far fewer than I had before.

The next step was to try and separate out the original files downloaded from my cameras from all of the other random images. For that, I did write my own script. I scanned through all of the images to extract the original image date/time from the Exif data, reorganizing the images into directories by the day the picture was taken. If an image had no original date in its Exif data, or no Exif data at all, then I assumed the file was not a photograph (or not one of my photographs) and put it off in a separate directory to be sorted through manually later.

Here's the script I used:

#!/usr/bin/perl

use strict;
use diagnostics;
use warnings;

use Date::Parse;
use File::Find;
use Image::ExifTool qw(:Public);
use POSIX qw(strftime);

my( $source_d      ) = '/Users/matt/Desktop/Recovery/jpg/';
my( $base_dest_d   ) = '/Users/matt/Desktop/Recovery/jpg-sorted/';

my( $dir_date_format  ) = '%F';
my( $file_date_format ) = '%Y%m%d-%H%M%S';

my( $nodate_i, $nodate_d ) = (0, 00);

if( ! -d $base_dest_d )           { mkdir $base_dest_d; }
if( ! -d $base_dest_d.'NoDate/' ) { mkdir $base_dest_d.'NoDate/'; }

sub wanted {
    my( $source_file ) = $File::Find::name;
    my( $source_date, $dest_d, $target_f );
    
    unless( -f $source_file ) { return; }
    unless( $source_file =~ /\.jpg$/ ) { return; }
    
    
    my( $info ) = ImageInfo($source_file);
    if( $info->{DateTimeOriginal} ) {
        $source_date = str2time($info->{DateTimeOriginal});

        $dest_d = $base_dest_d .
            strftime($dir_date_format, localtime($source_date));

        $target_f = strftime($file_date_format, localtime($source_date));

        # in addition to naming the file by date, give the image an index
        # number that advances if there is more than one image with the same 
        # date+time
        my( $target_i ) = 0;
        while( length($target_i)<2 ) { $target_i = '0'.$target_i; }
        while( -f $dest_d.'/'.$target_f.'-'.$target_i ) {
            $target_i++;
            while( length($target_i)<2 ) { $target_i = '0'.$target_i; }
        }

        $target_f = $target_f.'-'.$target_i;

    } else {
        # images with no date/time get put into subdirs, 100 images per
        # directory to keep the directory from getting too large
        $nodate_i++;
        if( $nodate_i > 100 ) { 
            $nodate_i = 1;
            $nodate_d++;
        }
        while( length($nodate_d)<3 ) { $nodate_d = '0'.$nodate_d; }
        $dest_d = $base_dest_d . 'NoDate/' . $nodate_d;

        $target_f = $_;

    }

    if( ! -d $dest_d ) {
        mkdir $dest_d or die "failed to create dest dir $dest_d: $!";
    }

    my( $final_file ) = sprintf( "%s/%s", $dest_d, $target_f );
    printf "%s: %s: %s\n",
        $_, $info->{DateTimeOriginal} || 'NoDate', $final_file;

    link( $_, $final_file ) or die "failed to link files $_:$final_file: $!";
}

find(\&wanted, $source_d );

This has left me with about 7,300 images sorted out into directories by the date the picture was taken, and about 13,600 in directories of images with no known shoot date.  This is far more manageable!  I'll probably still wind up doing a bunch of manual sorting of the images that are left, but now the task is much more approachable than it was in the beginning.   It's also possible I could find some other useful piece of Exif data to sort them by.

1 comment: