Crashplan

[See Crashplan Part Two for the follow-up]

Three weeks ago I started to write a blog post about Crashplan. This is not how I expected it to turn out.

This is likely to be quite long, so I'll put the conclusions at the front, and then the information I've used to draw those conclusions follows.

If you're a Crashplan user (quite possibly because I've recommended it to you in the past) you need to be aware that.
  1. Previous versions of Crashplan have silently corrupted data that has been backed up.
  2. The team at Crashplan are aware of this. More recent versions of the software do not have this problem.
  3. However, more recent versions of the software do not fix, acknowledge, or in any way indicate that some of the files in the backup are corrupt.
  4. Crashplan support appear to wholly unconcerned with this in a manner that means I no longer have faith in the product or their support. I leave you to determine the course of action that's right for you.
With that out of the way, some background, and the events that lead me to the four points above.

I've been an enthusiastic user of the Crashplan backup software for something like two and a half years. I forget how I found it -- probably some blog post or mailing list -- but it seemed to me to be a great example of software that just works. It was flexible enough to handle my backup needs, and easy enough to use that I recommended it to family. friends, and work colleagues. I'm a paying customer, and have purchased Crashplan licenses to give to other people as gifts to encourage them to back up their important data safely.

So for more than two years my main computer at home has been backed up using Crashplan, initially to a locally attached USB drive, and latterly also to a colleague who I convinced to run Crashplan for his backup needs.

One of Crashplan's more useful features is that the software will auto-update, prompting you when a new version is released. So during this period I've very closely tracked whatever the most recent version of Crashplan is.

A couple of weeks ago I purchased a new PC, and the plan was once I'd gone through the somewhat tedious business of reinstalling my software, restoring all my data, and so forth I was going to decommission the old one. To that end, once the new PC was up and running one of the first things I did was install Crashplan on the new PC, make sure the old PC was 100% backed up to the USB drive, and then plug the USB drive in to the new PC.

When you do this, Crashplan can "attach" to the backup. Even though the files in the backup weren't from the new PC I just had to enter the password for the backup so it could decrypt them and restore them to the new PC. I thought this would be the simplest (and probably fastest) way of migrating my data from the old to the new PC.

I let Crashplan chug along doing the restore, which took several hours because of the volume of data. And then, at the end of the process, I saw a warning that 140 files have failed the "integrity check" during the restore, and couldn't be properly restored. All of them were digital photos.

Now this is a bit odd. One of the things that the Crashplan team champion on the website is the following claim:
Once your files are backed up, CrashPlan continuously checks that your files are 100% healthy and ready to restore when you need them. If it finds any problems, CrashPlan fixes them.
Source: http://www.crashplan.com/consumer/features.html
For me, this is a big benefit.  One of the things you should do when backing up data is periodically try and restore it, to ensure that the backup is actually working. The fact that Crashplan tries to do this in the background was an important part of choosing the software.

Now I knew the backup was complete -- I'd verified it before I unplugged it from the old PC, so this is one of those things that should just never happen.

I sent an e-mail to the Crashplan support address. This generated ticket #20145 in their queue, and my message went like this:
Hi,
I'm migrating to a replacement PC. I decided to migrate my data across by plugging the external hard drive that the original PC backs up to using Crashplan+, and then restoring from that archive on the new PC, running version 3.8.2010. 
29,742 files restored correctly. 140 failed, listing in the History tab as - Integrity check failed for  
First, it would be very useful if I could cut/paste the contents of the History tab. It would make it much easier to figure out which files I'll need to copy over by hand. 
Second, and much more importantly, I'm very concerned by this. From http://b1.crashplan.com/consumer/features.html:

Once your files are backed up, CrashPlan continuously checks that your files are 100% healthy and ready to restore when you need them. If it finds any problems, CrashPlan fixes them. 
This does not appear to have happened. How do I find out what went wrong in this instance, and how do I fix it?
About 5h30m later (which is, by the way, fine, we're in very different time zones, so that sort of response time is not only perfectly acceptable it's probably above and beyond what I would normally expect) I get a reply from Renee at Crashplan, asking if I can send logs from the destination computer, and instructions on how to do that. I do so, and over the course of a few days (a short vacation intervened) I send logs from the source computer (i.e., the one that's been doing all the backups over the last few years) as well.

A day and a half after I send the necessary logs I get a reply from Bret at Crashplan. He says:
Unfortunately these logs don't point to a clear source of this error. A copy of the restored file was preserved with a modified name; it may be useful for you to review this modified file and let us know if the file that was restored appears to be correct or is non-functional. For example, the following file: 
C:/Documents and Settings/Nik Clayton/My Documents/My Pictures/2006/2006 07 14 All Things Gothic/IMG_2301.JPG 
was restored to the following location: 
C:\Users\nik\Documents\My Documents\My Pictures\2006\2006 07 14 All Things Gothic\restore.failed-checksum.IMG_2301.JPG 
Can you attempt to open this file and verify that it is a well-formed JPEG file?
I do some digging, and reply about five hours later, with

It's not a valid file.  Windows Photo Viewer refuses to open it.

The restore.failed-checksum.* files have suspicious file sizes:

09/09/2008  19:42           786,432 restore.failed-checksum.IMG_2296.JPG
09/09/2008  19:42           786,432 restore.failed-checksum.IMG_2297.JPG
09/09/2008  19:42           786,432 restore.failed-checksum.IMG_2298.JPG
09/09/2008  19:42           786,432 restore.failed-checksum.IMG_2299.JPG
09/09/2008  19:42           786,432 restore.failed-checksum.IMG_2301.JPG
09/09/2008  19:42           786,432 restore.failed-checksum.IMG_2302.JPG
09/09/2008  19:42           786,432 restore.failed-checksum.IMG_2303.JPG
09/09/2008  19:42           917,504 restore.failed-checksum.IMG_2305.JPG
09/09/2008  19:42           917,504 restore.failed-checksum.IMG_2306.JPG
09/09/2008  19:42           786,432 restore.failed-checksum.IMG_2307.JPG
09/09/2008  19:42           917,504 restore.failed-checksum.IMG_2308.JPG
09/09/2008  19:42           786,432 restore.failed-checksum.IMG_2309.JPG
09/09/2008  19:42           786,432 restore.failed-checksum.IMG_2310.JPG
09/09/2008  19:42           655,360 restore.failed-checksum.IMG_2311.JPG

They're all exact multiples of 1,024, and far too small.  Compare and contrast with the same files that I restored by direct sync from the source PC to the target PC:

09/09/2008  19:42         2,999,745 IMG_2296.JPG
09/09/2008  19:42         3,029,664 IMG_2297.JPG
09/09/2008  19:42         3,102,390 IMG_2298.JPG
09/09/2008  19:42         2,923,048 IMG_2299.JPG
09/09/2008  19:42         2,939,522 IMG_2301.JPG
09/09/2008  19:42         3,077,000 IMG_2302.JPG
09/09/2008  19:42         2,707,091 IMG_2303.JPG
09/09/2008  19:42         3,478,028 IMG_2305.JPG
09/09/2008  19:42         3,509,851 IMG_2306.JPG
09/09/2008  19:42         2,627,625 IMG_2307.JPG
09/09/2008  19:42         3,169,280 IMG_2308.JPG
09/09/2008  19:42         2,859,546 IMG_2309.JPG
09/09/2008  19:42         2,924,675 IMG_2310.JPG
09/09/2008  19:42         2,518,022 IMG_2311.JPG
It goes quiet for two days, and then Matt Genelin takes over the ticket, saying:
Let me step in here. Thank you for the log files. After checking with several engineers on our staff, our best causation of the 140 files missing / corrupt is as follows: 
The 140 files stored on your external hard drive are inaccessible because they were stored with an older version of the CrashPlan Application that has a known issue with incorrectly checksum-ing stored files in a backup archive. We have corrected this issue in the last 12 months, and the current version of the CrashPlan Client Application backs up files with the correct checksum information. 
Moving forward here, the best recommendation we can make is: 
1. Restore your complete archive from your other backup destination (I believe this is [redacted]).
(verify that your restore is successful.) Then proceed to step 2: 
2. Shutdown the CrashPlan Backup engine on [redacted] like this:
http://support.crashplan.com/doku.php/recipe/stop_and_start_engine 
3. Erase, delete or replace the backup archive that is stored on your external drive named "Folder: External 320G". Simply perform a file copy from [redacted] to your external drive. 
Please note that since your external drive was created on 12/12/2008 and your archive on [redacted] was created on 2/7/2009, you will loose any file version information that was made between December 2008 and Feb. 2009. 
4. Restart (start) the backup on [redacted], again:
http://support.crashplan.com/doku.php/recipe/stop_and_start_engine
If this seems like an unreasonable fix to this issue, please let me know.
"[redacted]" was the name of the remote destination I also back up to -- since it's a colleague's name I've removed it from the above.

I should mention at this point that none of my data has been irretrievably lost. My original PC is still here, and with some faffing around I can retrieve the missing files from it (or download them from the [redacted] offsite backup). But that is purely by luck. If all my backups had the same problem, which is not an unreasonable assumption, this data (140 digital photos) would have been lost forever.

I wasn't sure that I'd quite understood Matt correctly. In particular, with the reference to an older version of Crashplan I thought that perhaps he'd misunderstood, and assumed that the backup I was restoring from was only created with an older version of Crashplan. So we had the following exchange. First, me:

While it is the case that I first started backing up to "External 320G" using an older version of Crashplan, the Crashplan version has been regularly (auto)updated since then.  The specific set of steps I carried out to do the restore was:

1.  Power up PC #1 (runs XP SP3, Crashplan+, and is the machine that "External 320G" has been plugged in to for the last few years).

2.  Verify (through the Crashplan UI) that Crashplan thinks that the backup of PC #1 to "External 320G" is complete.  This is using the latest version of Crashplan (3.8.2010) because it auto updated earlier in the month.

3.  Power down PC #1, power off the external drive, power up PC #2 (Windows 7), plug the external drive in to PC #2 and power it up.  Install the latest version of Crashplan from crashplan.com, import the backup from the external drive, and attempt the restore.

That then generated the checksum errors for 140 files upon restore.

Are you saying that backups that were started with the older version of Crashplan may have this problem, and that simply using the newer version is not sufficient to correct the issue -- the corrupt backups need to be wiped, and the backup started afresh?

I've just reviewed the release notes going back to 12.10.2008, and don't see this mentioned.
 Matt's reply:
Correct. The Backups being the backup archive on your External Drive. I am recommending: 
1. Verifying the [redacted] Backup.
2. Wiping the External Drive.
3. Coping the [redacted] backup archive over to the External Drive. 
Seem reasonable?
At this point I'm still not quite convinced that I have this right. In particular, he's not correcting my assertion that this is a problem they've known about, and fixed with no notice in the release notes, and no mechanism to fix existing-but-broken backups. After all, this is a company that sells backup software (and sells an optional service whereby they'll host external backups for you). They wouldn't be that cavalier about the integrity of their customers' data, would they?

So I replied:

Well, I don't need to do that, because I've moved the data from the old machine by other means -- restoring from the backup was (supposed) to be the simplest way to do this.

However, I want to make sure that I understand you correctly.  Are you saying that the following sequence of events:

1. Install Crashplan in 2008.
2. Tell Crashplan to backup to an external drive.
3. Let Crashplan autoupdate throughout 2008, 2009, and 2010, and continue to backup to the external drive throughout this period.

is sufficient to cause this corruption?  This was not an external backup that I created once using an old version of Crashplan, and then put away -- the external drive has been attached to this PC almost continuously, and Crashplan (from the earlier 2008 version to the most recent March 2010 version) has been backing up to it on pretty much a daily basis.

I must ask why Crashplan doesn't warn about this -- big red flashing letters saying "Warning: You created this backup with a version of Crashplan that had checksum errors.  You must delete this backup and start afresh".

Better still, why don't newer versions of Crashplan detect this and correct it automatically?  http://b4.crashplan.com/consumer/features.html is quite explicit:

Once your files are backed up, CrashPlan continuously checks that your files are 100% healthy and ready to restore when you need them. If it finds any problems, CrashPlan fixes them.

This does not appear to have happened here.

I'm very concerned that based on what I've been told so far it seems as though an older version of Crashplan corrupted my backup, you released a fixed version without noting the fix in the release notes, but the fixed version does not correct prior instances of the problem.

Right now I do not have a warm fuzzy feeling about continuing to trust Crashplan with my data.
 Matt's reply:
Yes, that is correct. This is what I am stating here. 
I am also explaining that an older version of CrashPlan has a known issue -- that has been corrected in our newer versions of the CrashPlan Client. This known issue appears to have passed our nightly archive maint. check: 
And only appears when you attempt to restore files. 
Normally our website is correct; once you back a file up, there is no need to worry about your files. In your case, it appears from your archive that some of your files were backed up with a version of the CrashPlan Client with a known issue, and the newer versions CrashPlan Client's nightly archive maint. did not detect the problem in your archive. The problem here surfaced when you went to restore your external drive's archive, that is 99.996% fine, but 0.004% corrupted. 
I am suggesting a course of action that brings you back to 100% fine, and throws away the archive that is 0.004% corrupted. 


I can understand. Your feelings on CrashPlan are a conclusion you will need to come to on your own. 
Let's keep in mind the facts here: 
* Only one of our multiple-destination archives is having issues here.
* The one archive that has issues restored 29,742 correctly and failed to restore 140 files. That's a failure rate of 0.004%. 
I agree -- this is not perfect. Perfection would be 100% data recovery. This is why CrashPlan Allows you to backup to multiple destinations. You should be able to achieve perfection of recovery by using your second archive; on your [redacted] computer.
A couple of points here. Matt's skipped over my "Why doesn't Crashplan warn about this, and/or fix the problem automatically?" question. He also seems to think that you can quantify the effectiveness of a backup solution by taking the number of files, and divide that by the number that failed to restore as some sort of useful metric. That takes no account of the relative importance of the files -- these were photos, and irreplaceable, nor the absolute volume of data lost.

He also assumes that I can restore the files from the [redacted] site. While that may be possible (and I haven't tried, I haven't needed to) that backup was created by taking a copy of my local backup archive and giving it to my colleague, so it's entirely possible that that archive has the same problem as my local one.

And finally, the Crashplan site is quite explicit, "100% healthy and ready to restore". There's no equivocating around some-number-of-9s availability. They claim 100%.

My final message to Matt asked:
1.  Will the next release of Crashplan detect this problem and fix it.
If not, when will it be fixed? 
2.  Why wasn't this problem called out in any of the release notes for versions released after the problem was detected?

3.  Will you inform existing customers of this problem, and the need to wipe and restart existing backups if they're older than date ?
All, I think, reasonable questions, which are ducked in Matt's final reply.
It has been a pleasure working with you. It's clear that the technical recommendation I have made for you will correct the issue at hand here, and that the quoting of text back and fourth is leading our conversation in a circle. I want to bring you to a place that moves you forward, and the best way to do this is to end our conversation now. 
I believe I have answered your questions repeatedly, and your questions are deviating away from solving your technical problem. By closing this conversation, I am hoping that you will take my recommendation in good faith, and apply it to your unique situation to move your backups with CrashPlan forward.
Looking back through this discussion those three questions are not answered.
  • There's no commitment that future versions of Crashplan will detect and fix this problem.
  • There's no answer as to why Crashplan weren't honest about this problem in the release notes of the software once they detected and fixed it.
  • And there's nothing to suggest that they'll inform existing customers of the problem.
So, if you're using Crashplan you should definitely make sure that your backup is 100% readable by the current client. And if it isn't you'll need to wipe it and start the backup from scratch again.

You might also want to start thinking about trusting your data to a different organisation; and in particular one that values honesty when it notices and fixes a mistake that leads to data loss.

Has anyone got any recommendations?

17 comments:

  1. Thanks for posting your experience... for a company who normally seems to be quite up-front about their shortcomings, the fact that they haven't publicized this is really surprising. It would seem to be in their best interest to do so. Much better that they tell everyone now vs. only telling people when they're in the nightmare scenario of wanting to restore their backup and finding out only then that it is corrupt.

    Very disappointing.

    ReplyDelete
  2. I recently found Crashplan and I thought wow, this is perfect. After seeing your situation and Crashplan's less than forthcoming replies, I've pretty much lost faith in utilizing their product. Especially since you mentioned pictures, which, when it comes to my daughter, each one is priceless. If they had just said "yes, it was broke and this is how we fixed it, and this is how we made sure it doesn't happen again", it would be fine by me.

    Thanks

    ReplyDelete
  3. Hey!! Thanks again for the tour last Friday, it was so fun!
    Believe it or not I read this whole thing and it was interesting...totally baffled about why this guy could not answer your question! It was completely relevant to keeping a crash plan customer. very weird.

    ReplyDelete
  4. A reference to this has been posted in the CrashPlan user support forum.
    https://crashplan.zendesk.com/entries/140286-silently-corrupted-data

    ReplyDelete
  5. Hi Nick, I sent you a private ticket with explanation & clarification on your post. Have you seen it? Sent it over a week ago on March 30th. Ticket #20145

    ReplyDelete
  6. Point 1- we did not silently corrupt your data, we failed to heal around it. See our forum post for more information.
    Point 2- Isn't valid as point 1 isn't. We have made improvements to our healing process.
    Point 3- If you saying there are forms of corruption in an archive we cannot heal around, thats true. If you say we can't detect it, that's false. We do.
    Point 4- We absolutely care. I agree the tone of one of our engineers was not as concerned as it should have been. This engineer has been enlightened.

    There are a lot of technical details I want to share that dig deeper, I'll post in our forums later this week. My goal is to better explain what happened, what we do, and why we're so proud of our product.

    I haven't searched far and wide, but this is the only negative post I personally know of.

    ReplyDelete
  7. After reading Matthew Dornquast's replies in the posts here: https://crashplan.zendesk.com/entries/140286-silently-corrupted-data

    I feel compelled to comment and say that although this blog is thorough and very informative it looks very much likely that the mis-communication you experienced with the crash plan support seems to be an unfortunate one.

    However,It seems to me, like any written conversation, this can be taken in numerous ways and the support you received may have been naive(and junior as mentioned in the replies stated above) and couldn't see that although you had it all backed up elsewhere and that it didn't theoretically matter. He couldn't see that you were concerned for future problems you may come across.

    I work with computers and quite often have to deal with numerous problems and placing calls and emails to many a support desk. I always find it difficult with emails and with some of the responses I get its fairly obvious to me that they do not understand what I am asking. Sometimes it takes a phone call to clear things up and even then it can be hard work as some support people can be very single minded and over complicate a situation. this does not mean they are not good at their job. and in most cases quite the opposite in fact.

    I'm sure there are numerous solutions to the problems you incurred via your email conversations. Perhaps the support contact should have passed you onto someone more senior(perhaps someone more senior should have been dealing with this all along, rather than (as what seems to have happened to you) passing you down the line.) maybe you could have requested a 2nd opinion or a new support contact.

    Maybe crashplan should consider setting up a Phone support system. (I know this would incur costs which would not be beneficial to the service. but if this could not be done then perhaps a support member could call a user to explain things clearer in a situation like this?)

    Life is full of ifs buts and maybes. Your blog is sharing an opinion of an unfortunate situation that seems to have occurred. I would like to think that you would read/respond to Matthew and update your blog accordingly.

    Although you maybe one of the first people to experience this situation and feel the need to share what you feel to be the first crack in a pretty solid looking wall so far it only takes a few cracks before the wall starts to crumble.

    Crash Plan seem to be dedicated to what they do and for a very reasonable price. It would be unfortunate for everyone involved if this spirals into something bigger than it seems to be. I'm sure you wouldn't want to be responsible for scaring people away from a service that encourages backing up your files after all this is something that everyone should do as I've heard and seen many people devastated by losing all their data, their last remaining memories/pictures of a loved one etc

    To Summarize
    I do understand your point and I do see that mistakes were made in the way this was dealt with. nevertheless I wouldn't want this blog to compromise peoples opinions on backup solutions as a whole.

    A backup is ESSENTIAL in any computer users life and the more people we can get doing this the better this will be for EVERYONE.


    Sorry for posting such a long comment and apologies for any misspelt words I was so passionate about this I have been writing this response for a while now and its 4:20am here.

    I look forward to a response.

    ReplyDelete
  8. Nik, there has been a previous case of this type of CrashPlan restore failure - see this different support thread: https://crashplan.zendesk.com/entries/31616-failed-checksum-corrupt-restore

    ReplyDelete
  9. I was excited to see unlimited data family plan for a reasonable price but then I noticed I would need to shell out EXTRA to get + features (backup continuously and restore from web). One would assume you should be able to restore anytime from the web. Why would I need to pay extra for this service?

    ReplyDelete
  10. While I appreciate the detail in the post of this experience, overall it's a fairly useless representation of CrashPlan. I always find amazing how people make out that a backup plan of any kind is infallible. 'Got crucial files that you can't live without? 'Got only one fallback if things go awry? You'll find no shoulder of mine to cry on.

    ReplyDelete
  11. Matt's failure rate calculation contains a fault in itself: 140 failed files out of 29.742 isn't a failure rate of 0.004%.

    The percentage need to be shifted by two positions. The failure is not the disappointing 0.004% but an even more disappointing 0.47%.

    ReplyDelete
  12. I was going to write a long note aimed at why this blog should be softened and taken with a grain of salt but the anonymous 3 anon's above me seems to have already done that.

    As a Unix System Administrator for over 15 years I can honestly say that I've seen far worse from far more expensive software (EMC NetWorker and Veritas NetBackup -> I'm talking to you). If you want 100% reliable maybe EMC Avamar is for you... sure you can only use 80% of the storage, you can't backup to it for 8 hours a day, and you must have 2 of them for replication or have extra parity storage before they'll sell it to you... and it costs more than your house.

    I'm impressed the CrashPlan does seem to care about this thread, your engineer gave you viable solutions and to the credit of CrashPlan it atleast identified the problematic files versus enterprise software that just assumes the data is ok and restores bad data.

    Echoing what the other anon said... backups are better than no backups at all - which if your drive had been dead you'd be extolling how CrashPlan flawlessly recovered all but .004% of your files! And if your house had been burgled you'd have been screaming for joy that only .004% of your photos were lost since you had backed up to your colleague as well. That guy sitting there with a tape drive or a one button backup solution would still be crying.

    I back up my wife's laptop with Mozy... honestly because EMC has too much to lose by misplacing or misusing my data.

    CrashPlan appears to have an excellent feature set -- for the love of god don't drop Solaris -- and a pretty deft technical team based on the e-mail exchange, and a concerned management. In my book the pros far outweigh the cons in my book.

    Lastly your point about them redflagging this "bug" is truly interested by this blog... news travels fast. Some bugs are worth gambling that they are discovered, resolved, and never get out of the lab. Two weeks ago my storage array went down for 60 seconds cost us 12 hours of work recovering... no ones throwing EMC storage out of the environment, even though it went down on a known bug that a later version would have fixed...

    ReplyDelete
  13. With the recent product update the + features are built in now (with online hosting). But to answer your question from the time period, I never found it onerous to know that I merely had to download the client, log in with my account details and then do the restore. I've missed the web interface for restore, and now that it's included, I'm still not likely to use it, but nice to know it's an option.

    ReplyDelete
  14. I feel really sorry for those 3 poor "support engineers" who were truthful and confirmed the software problems.
    And I find it despicable to indirectly blame them for anything.

    After reading all this, it is clear that crashplan tried to do some damage control, but they opted to do it in a very dishonest and unprofessional way.
    Erasing posts for PRIVACY reasons?Gimme a break!
    Blaming the author of having bad data, while in fact their software corrupted the backup?
    Miscommunication my A$$.

    Reading more about the author, I have no doubt that he did everything correctly and that he is not to blame at all.

    On contrary, reading what Matthew Dornquast wrote gives a clear picture of how desperate he tries to restate the lies and ignoring the facts.

    The easiest to finally prove everything is to reinstall that version and try it out with different scenarios.

    Please upload all the versions you have, I'll be really happy to try them out.
    We could upload them to torrent, that way crashplan cannot control the spread of truth again.

    ReplyDelete
  15. All the excuses and explanations are irrelevant.

    If they promised 100% then that's what they should deliver. If they can only deliver 99.53%, then that's what they should promise.

    Never inflate your customers' expectations beyond what you can deliver.

    ReplyDelete
  16. Crashplan is streets ahead of Mozy. Don't trust your data with Mozy - the problems described here are minor compared to being bitten by the Mozy.

    ReplyDelete
  17. crashplan way better than carbonite also. interesting to read this, thanks for sharing and i am going to doublecheck my backups too!!

    ReplyDelete