Jump to content

Reliably uploading large archives to Amazon S3 and Glacier (long)


Recommended Posts

<p>It's best to have several layers of backup for important images, ranging from an online copy, to a nearby offline copy, to offsite copies (perhaps on DVDs), to copies in the cloud. The idea is to protect against increasingly rare but geographically-wide disasters with increasingly delayed, but reliable, retrieval. For example, it may take hours to retrieve the offsite backups, or days to download the cloud backups, but a disaster (such as a building explosion, as just occurred in New York) that would necessitate going to them is extremely rare, so the time isn't a problem.</p>

<p>I've been using S3 for several years as my backup of last resort, but recently decided to move those archives to Glacier, which is slower and potentially costlier to retrieve, at one-third the storage cost of S3. This has prompted me to rethink how I structure and upload the archives.</p>

<p>I'm concerned only with older images on which I no longer work and for which retrieval time is irrelevant, since the chances of needing to retrieve are extremely small. (As I said, I have other backups immediately accessible.)</p>

<p>To make the discussion concrete, suppose you have 10,000 images from 2005. You could just use a backup utility to copy them directly to Glacier. I haven't used any of these, but I see from searching the internet that several are available. The problem with doing it that way is that you need to ensure that all 10,000 have been copied perfectly, and you want to do that independently of the backup utility (which could have bugs), as a double-check. Amazon keeps an MD5 checksum for each image, but comparing 10,000 checksums with the original files and re-uploading the bad ones is a huge amount of work, and I know of no utility that automates the job.</p>

<p>So, what I do is archive each year's images into a ZIP file. You can use whatever utility you want for that; I use the one that comes with Mac OS X. This gives me files ranging from about 10GB to 30GB.</p>

<p>Again, since this is the backup of last resort, you have to check the ZIP file before you upload it, and that check should be independent of the of the utility that did the ZIPping. What I do is unzip the archive and then compare each extracted file with the original using a folder-compare utility (lots are available). (The verification built into ZIP utilities is inadequate, as it doesn't compare the extracted data with the original data.)</p>

<p>Now, given that the ZIP file is known to be good, you need to upload it. Few uploading utilities work directly with Glacier, so, instead, I set my S3 bucket to transition files automatically to Glacier (it's a lifecycle option you set on the Amazon S3 management site), and then I upload to S3. I set the transition time period to zero, so that there's no charge for the time the the file sits on S3 prior to going to Glacier.</p>

<p>No matter how fast your internet connection, the upload is going to take hours, even days. Nearly all upload utilities I looked at did the upload as a single operation, including the ones provided by Amazon. If there's an error, or the utility or computer crashes, you have to start over. I thought that Amazon's command-line utility would be the most reliable, but I never once got an uploading to complete without the utility exiting with an error after a few hours. The well-known Mac app Transmit wouldn't even deal with the large files. A couple of free or cheap uploading apps just crashed right away.</p>

<p>One utility I found that works is BucketExplorer, which seems to cost $50. (I also wrote my own utility that runs as a Chrome App, but I'm not yet ready to make it available.) BucketExplorer divides big uploads into 32MB parts (probably adjustable), and then stores the upload queue on disk, along with the status (successful or pending). For a file I'm uploading now, there are about 740 parts. You can kill the app at any time, and, when restarted, it will pick up when it left off. Last night, while watching a movie on Netflix, I paused the upload to free up bandwidth, and restarted it when the movie was over. Very convenient!</p>

<p>Again, keeping with the policy of independent double-checks, you don't want to rely only on BucketExplorer to verify the upload. What I do is get the ETag as reported by Amazon after the file has been uploaded, which is actually an MD5 sum of the file's contents. However, for multipart uploads, calculating the sum is complicated, and you can't use an MD5 utility (such as the one that comes with Mac OS X) directly. For multipart uploads, you have to take the MD5 sum of each part, and then take the MD5 sum of the collection of those sums. I know of no utility to do this, but if you Google "md5 sum of S3 multipart" you'll find some posts on stackoverflow.com that explain how to make the calculation. This is troublesome, I know, but it's worth it for long-term storage of an archive.</p>

<p>I hope you've found the above helpful. I'd appreciate hearing about your own methods for uploading files to S3 and Glacier reliably, and also whether you've come up with any better methods for independently checking ZIP files and uploads than the ones I've adopted.</p>

Link to comment
Share on other sites

<blockquote>

<p>No matter how fast your internet connection, the upload is going to take hours, even days. Nearly all upload utilities I looked at did the upload as a single operation, including the ones provided by Amazon. If there's an error, or the utility or computer crashes, you have to start over. I thought that Amazon's command-line utility would be the most reliable, but I never once got an uploading to complete without the utility exiting with an error after a few hours. The well-known Mac app Transmit wouldn't even deal with the large files. A couple of free or cheap uploading apps just crashed right away.</p>

</blockquote>

<p>I found the above the most helpful. It confirmed for me to just continue on backing up the way I've been doing it for the past 7 years which has been to clone twice my entire HD to two external hard drives maybe once or twice a year.</p>

<p>It's been working great so far. Thanks for the time and effort on relaying your own experience on this subject.</p>

<p> </p>

Link to comment
Share on other sites

<p>Thank you Marc for that insightful and thorough post. I started in 1997 with burning cd's and still continue with optical disc burning today. My second medium is a clone of hard drives as well as running SyncBack to a NAS. My third medium is cloud and use two services, CrashPlan and (about to start with) MS OneDrive<br>

After reading the <a href="https://www.eff.org/encrypt-the-web-report">Electronic Frontier Foundation’s Encrypt the Web Report</a>, I'd hesitant to use Amazon.<br>

https://www.eff.org/encrypt-the-web-report</p>

<p> </p>

Link to comment
Share on other sites

<blockquote>

<p>@Tim: No cloud for you? ;-)</p>

</blockquote>

<p>On a 3mbs AT&T U-verse $30/month subscription it'll take more than a day to upload a clone of my hard drive. And since you said it didn't matter the internet speed I'm not going to upgrade my internet package. AT&T keeps raising the price with nothing extra to show for it and then I have to call their retention dept. to get it back down to where it was. Not a pleasant experience.</p>

<p>YES! NO CLOUD FOR ME!</p>

Link to comment
Share on other sites

>>> YES! NO CLOUD FOR ME!

 

Me either. I'm fine and much prefer employing and taking responsibility for my own offsite backup system. With 100K+

images over the years that works great for me.

 

Also... Though I'm confident Amazon will not go under anytime soon, there have been a couple online data

backup companies that have gone out of business in the past leaving their customers in a lurch.

www.citysnaps.net
Link to comment
Share on other sites

Here's my personal backup strategy...

 

I have my primary set on a RAID 5 array.

 

Every night I use a utility to copy from the primary to another RAID 5 array

 

I use a cloud storage to copy the photos to the cloud

 

Once a year I copy all the photos to new set of hard drives and store them at one of my kids house - about 8 miles away

 

In the event of a disaster, the once a year copy at my kids house eliminates the issue of having to do a mass download of

everything.

Link to comment
Share on other sites

<p>Thanks, Marc. I'm researching free and cost effective cloud storage as an emergency backup. I've recently noticed a few DVDs I burned 10 years ago have gone bad. Burglary and theft are serious concerns in my area. I have few options for offsite storage of my own hard drives. And my budget is squeaky tight. So the free storage options offered by Amazon (for Prime subscribers) and Google are potentially useful to me, if only for last ditch backups of edited JPEGs.</p>

<p>The main limitation I've found with Amazon is lack of syncing. However it does appear to prevent inadvertent or deliberate uploads of redundant files (not sure if that's based on filenames or other data). For now I'm using it only for scans of my older film photos and darkroom prints; and for raw and edited files from a couple of documentary projects. It would be too cumbersome for the roughly 2-3 TB worth of photos I've accumulated over the past decade. On the plus side, Amazon does accept all of my raw files - Nikon NEFs, Ricoh DNG and Fuji RAFs.</p>

<p>I've also been trying the Google+ backup, which is unlimited for files less than 2048 pixies in dimensions. That's pretty close to the maximum file sizes for my old Nikon D2H 4mp files anyway. However I'm having difficulty getting it to sync with my desktop PC. I might try switching from the wifi doodad back to hard wired connection to see if there's some timeout error.</p>

<p>Regarding encryption, I'm not concerned. These photos are intended to be seen, eventually. There's nothing that should interest the gummint. I might be concerned if I was doing photojournalism or documentary photography or reporting in sensitive areas or conflict zones, but I'm not. My primary concern is redundant backups.</p>

Link to comment
Share on other sites

<p>@Lex: I, too, don't encrypt the files. I'm afraid that 10 or 20 years from now the key will be lost or the app that encrypted them won't be available. (There's nothing of commercial value, and nothing personal that I'm concerned about. Lots of nudity, true, but the models are infants.)</p>
Link to comment
Share on other sites

<blockquote>

<p>Thanks, Marc. I'm researching free and cost effective cloud storage as an emergency backup.</p>

</blockquote>

<p><br /> You're probably aware Lex, but for a $9/mnth subscription to MS Office 365, you get unlimited storage with any file type and file size at MS OneDrive. I'm having good luck with SyncBack Free to sync local folders with OneDrive.</p>

<blockquote>

<p>But, inasmuch as this is my third or fourth level of backup, there's no loss other than the inconvenience of parking the data someplace else.</p>

</blockquote>

<p>Precisely. Time and internet speed is irrelevant when this is your third or fourth tier of back-up and it just trickles away unnoticed in the background. It took 14 months for my computer to back-up to Crashplan. But oddly, once it was completed, I finally felt safe.</p>

Link to comment
Share on other sites

<p>Marc,<br>

<br />Re: "For multipart uploads, you have to take the MD5 sum of each part, and then take the MD5 sum of the collection of those sums."</p>

<p>That method will give you a MD5 hash of the collection of individual has values and it will not be the same has as a MD5 of the original binary. </p>

<p>You would have to have a hash utility where you could sequentially feed in the "seed" value of the previous partial binary object hash (of each partial binary file in order) as an input to the current partial binary object hash operation to generate the hash of the full intact binary.</p>

<p>I note from some blogs that if you move the S3 object within your cloud, then S3 will re-caculate the DM5 Etag. That may be your easiest solution.</p>

Link to comment
Share on other sites

I'm personally a big fan of using S3 for arbitrary hosting of

files and images. I figure if you're paying usage pricing

you're very unlikely to have someone decide your service tier

is unprofitable and cancel it later. I'd bet you could use

Amazon's command line tools with some shell scripting to

automate this process pretty easily. With any luck(and not

using named pipes) you could probably even use Cygwin to

get it working under Windows. Maybe I'll look at it over the

summer.

Link to comment
Share on other sites

<p>I've been backing up my photos to Glacier for 13 months now. Total charges for the first year were $68.50. Once I had it configured (Glacier and S3 isn't exactly intuitive) Arq worked on its own to upload my photos. I highly recommend Arq. The app is routinely updated and the developer is responsive to queries.</p>

<p>Using Uverse I initially uploaded about 440GB of images. I averaged .8GB (that's point 8) of data uploaded per hour, taking about 24 days.</p>

<p>I'm also experimenting with Backblaze to upload everything, not just my photos. I snagged a year of their service for next to nothing, so it was worth trying. But it will almost take a just to upload all my data.</p>

<p>Since I have my images backed up to multiple HD's, one of which I keep in my office, I view Glacier as a backup of last resort.</p>

<p>With Google's Nearline and Amazon's Cloud Photo Storage in the mix, the cloud storage market for photographers became more interesting and confusing.</p>

<p> </p>

Link to comment
Share on other sites

<blockquote>

<p>With Google's Nearline and Amazon's Cloud Photo Storage in the mix, the cloud storage market for photographers became more interesting and confusing.</p>

</blockquote>

<p>Wonder what the future holds for this form of data archiving and whether upload speed and security will improve considering increase in traffic as more folks will be doing more online transferring of even more big data under current overloaded internet infrastructure (we're not even fixing our 50 year old bridges) with the future of net neutrality in question.</p>

<p>That doesn't even address what happens with this archived to cloud data and whether it's going to be around or lost either by accident or by some user agreement fine print that says the company hosting the cloud account isn't responsible for the data in the case of a buy out by another company or the monthly subscription price increases too unreasonably and the company debits the credit card with auto pay and the user is suddenly up to their eyeballs in debt.</p>

<p>And if you think I'm being unrealistic, you haven't been watching your cable and phone bills very closely when set up with Auto Pay to credit card. Someone elderly and infirmed having their bills payed this way will see an increase by as much as 30% unless they have someone call the company to protest to get the bill back down like I have to do with AT&T and Time Warner Cable. My electric company doesn't treat me this way.</p>

<p>Makes me wonder how the cloud account provider will treat similar infirmed clientele? Or will they force them to go into debt up to their eyeballs when they get well and find they're blocked access to their data unless they pay up?</p>

Link to comment
Share on other sites

<blockquote>

<p>Wonder what the future holds for this form of data archiving and whether upload speed and security will improve... </p>

</blockquote>

<p> <br>

It's obviously about "charging what the market will bare" but I still can't believe the high prices and limited data that the USA puts up with. Even in Canada, my cell phone is quicker and with more data than what I've been hearing about for average connections in the USA. Anyways, just around the corner in a few years is LTE5 and the speeds (and radiation) are phenomenal. Back on the ground, I sure hope Google Fibre will change the internet for North America pretty soon.<br>

<a href="http://techcrunch.com/2015/04/04/the-cloud-could-be-your-best-security-bet/#.q119rf:4iNV">This TechCrunch article, The Cloud Could Be Your Best Security Bet</a>, brings up some great points. http://techcrunch.com/2015/04/04/the-cloud-could-be-your-best-security-bet/#.q119rf:4iNV</p>

Link to comment
Share on other sites

<p>Just an end cap "I told you so" on this topic.</p>

<p>Yesterday as usual my AT&T 3mbs U-verse Pro standalone package bill went from $31 to $47 where I had to call their rep to get it back down. The rep said there are no longer any discounts for the 3mbs Pro package (regular price $47) but they have the 6mbs Elite package (regular price now down to...you guessed it...$47) with a $23.50 discount which brings it back down to $31 including the equipment fees and taxes.</p>

<p>I tried out the 6mbs speed and found no difference surfing the web and playing HD videos. Of course OOKLA driven speed test on AT&T's site shows I'm getting 7mbs download but 1.5mbs upload, still not fast enough for uploading gigabytes of data to a cloud account.</p>

<p>There ought to be a law against this kind of price manipulation for services that can't be readily seen or proven.</p>

Link to comment
Share on other sites

  • 2 weeks later...
<p>For what it's worth ... I use Code 42's CrashPlan. For $4.95/month I get unlimited space and their backup utility is configurable to the nth degree, so it time slices as you want. I've also set it to never delete files, even when I delete them from my local drive. I've used it for years now and did a test once, a bit-level compare (using Beyond Compare) on a RAW file. No issues. It putters away in the background and I rest assured things are backed up. I keep a few local backups as well and I run the NAS (Raid 0) via wifi to use CrashPlan so I have a mirror of what they have. I live in an earthquake prone area. Seems so cheap it's not even worth discussing.</p>
Link to comment
Share on other sites

<blockquote>

<p>Seems so cheap it's not even worth discussing.</p>

</blockquote>

<p>Yep. And cloud is also the safest form of storage. Just let it trickle away in the background while continuing your other backup regimes and you'll eventually get there with a cloud copy.</p>

 

Link to comment
Share on other sites

<p>@Marc: I use a program called Beyond Compare, it is produced by <a href="http://www.scootersoftware.com">www.scootersoftware.com</a><br>

I use it primarily for figuring out file folder content differences and use it to mirror drives. It's originally used in the programming world, so you can quickly show the content differences in lines of code. It works like a charm. It also does binary level analysis on any type of file and you can compare files side by side.<br>

So what I did is upload a RAW (NEF) file to CrashPlan, downloaded it again using their restore process (simply click the file in the interface and it plops it from their server to your desktop). Then I compared the original with the restored and saw no differences. Good enough for me. </p>

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...