How I Almost Recovered Data from a Failed Drive

As mentioned in a previous posting, we recently encountered a huge data loss due to the failure of our NAS drive. Though a lot of the data could be reconstructed, we decided that we want to try and recover as much data from the failed drive as we could whether it be through a drive recovery service or through open source tools. Because drive recovery services offered by third party vendors are so expensive (usually >$500) I first opted to try myself.

Upon taking the drive out of the original manufacturer housing, I attempted to connect it via SATA cables to various machines with varying degrees of luck. On one of the machines that was able to detect the drive, we ran a multitude of drive tests, all indicating failure. SeaTools for DOS, a popular drive testing program, can actually repair bad sectors. However, after several passes through the drive, SeaTools reported that there were no more repairs it could do.

Some operating systems detected the drive but couldn’t read any data from it, including information from the partition table. Some Windows machines could see the drive , but since it was partitioned by the manufacturer with 4 EXT3 RAID partitions, a popular Linux file system for NAS drives, Windows was unable to go so far as to even mount it. The only operating system that would consistently detect the drive was Ubuntu 16.04, but reported bad sectors and also refused to mount.

It became clear relatively quickly that I would have to use some sort of tool that was designed to read data from failed hardware. Since Ubuntu 16.04 was the only OS that consistently read the drive, I decided to use the variety of open source tools that are offered in its libraries. Because we’re going to eventually be using a tool that will look through the corrupted file system, we don’t want to keep reading and writing to the drive, potentially causing more damage. The first program I installed was gddrescue. This tool is designed to make image files of your drive in order to copy the data from the failed drive to a new drive, limiting the risk of further damaging the failed HDD. Opening up a Terminal window in Ubuntu, I ran

sudo apt-get install gddrescue

to install the utility. After checking a few of the command line options, I ran the program using

sudo ddrescue -r 2 /dev/sda2 ‘/media/dan/image’ ‘/media/dan/logfile’

This tells gddrescue to perform two passes of the drive and store the image file in the directory ‘media/dan/image’ and the log file in the directory ‘media/dan/logfile’. This ran at a speed of about 40 MB/s. At this rate, to image all 2 TB, it was going to take around 14 hours, so I just let it sit idle for a few hours.

After around 2 hours, I noticed that the entire OS had frozen up, something I have rarely experienced with Ubuntu. I powered down the machine, backed up the partial image file created by gddrescue, and connected the drive to another machine running Ubuntu and ran the same commands. Sure enough after two hours the machine hung again. I made the executive decision to stop trying to image the drive in fear of the read head causing more internal damage and to leave it up to the third party recovery professionals to do this work for me.

But just for grins, I wanted to see how much data I could actually recover from the ~50 GB partial image I backed up. I installed the tool foremost, a command-line tool which can recover files from a number of file systems. To install foremost, I ran

sudo apt-get install foremost

and

sudo mkdir /recovery/foremost

sudo foremost -i /media/dan/image -o /recovery/foremost

which simply creates a directory for the recovered files and reads the image file created using gddrescue. The tool ran for 30 minutes before completing and from the ~50 GB image file it was able to recover 2 files totaling an outstanding 3.76 MB. The first of which was a .PDF about proper interview techniques and the second was this picture. Enjoy.

00050440

Our server failed. What went wrong and where to go from here.

A few weeks ago, the primary storage device (a 2 TB Iomega NAS) where we place most of our commonly used files abruptly stopped working. After checking the hardware, we noticed a blinking light that indicated some sort of failure. This was a bit disconcerting considering the last backup we made of the entire drive occurred over a year and a half ago. Although a lot of data was lost, it wasn’t entirely catastrophic in the sense that a lot of the mission critical data could be recovered or reconstructed with time. Most of the data on the drive consisted of the current software we install across the majority of our machines, old versions of software, T.V. recordings, hardware inventories, notes, and photos. Over the course of a few weeks, most of the commonly used software has been repackaged on a temporary server. I’m currently in the process of reconstructing notes and hardware inventories and investigating ways to recover data from the failed drive, but with little success.

This is a prime example of how considering the possible failures and performing preventative maintenance could have prevented a major loss of productivity. Though a lot of important files were backed up individually throughout the year, the fact that a full backup hadn’t been done in months shows that following our philosophy is much harder in practice. In order to prevent something like this happening again, we are setting up our new server in a way that essentially makes the chance of losing any important data zero.

The temporary server I set up is two 4 TB HDD’s housed in a machine that already runs 24/7. It’s partitioned so that 2 TB’s from each drive are allocated to mirrored storage. This partition of redundant data is meant for files that are most frequently used along with more critical files such as notes, hardware inventories, and setup packages for programs. Another 2 TB’s on the first HDD is used for periodic backups of the first 2 TB partition. The last 2 TB’s on the second HDD is meant for holding archived setup files and old backups that aren’t generally touched. The idea is that at some point when these drives have a permanent place in a machine, a script would do incremental backups of the mirrored partition to the backup partition as well as the cloud, eliminating the chance that we could forget about backing it up again.

Philosophy

I mentioned in my previous post that I firmly believe in the ideas and philosophy that have guided my work at Saks & Associates for the past few years and now I’d like to elaborate on that with some level of abstraction to avoid getting bogged down in details.

One of the main philosophies we follow is to assume that any point, failure is a possibility that we need to be prepared for. Our main goal is to minimize the risk of losing data and avoid repeating any amount of work. It is important to note that there are a few different types of failures and errors that can occur when working with any system of computers. I separate them into three categories.

  1. Software bugs
  2. Hardware failures
  3. Human error

Any one of these failures can put an immediate halt to work and if the right precautions aren’t put in place beforehand, there’s going to be a huge amount of lost time and productivity spent fixing the damage. Obviously things like software bugs and hardware failures are out of our control, but we spend a lot of time taking into account all of the “what if?” scenarios that could transpire and adjust our plan accordingly. Our attitude towards human error is to acknowledge that we positively will make mistakes, but through running things over again by ourselves and constantly having the other double check our work, we are able to minimize the total amount of errors we make.

Dan also has also put a strong emphasis on designing a product that has the end user in mind. When talking about writing code, Dan often quotes Scott Meyers, saying “make interfaces easy to use correctly and hard to use incorrectly”. Although this quote is referring to good programming techniques, it seemed natural to me to extend the overarching idea of this to a more general “make it easy for the user to use anything you’ve given them correctly and hard to use incorrectly”. Always keep the end user in mind and constantly think about how they may use what you’ve created for them.

For me, this is one of the ideas that I have struggled with executing well and fully integrating into my thought process. It requires completely changing the way you think about the world, getting into the mindset of someone else, and escaping the confines of your own scope to see how someone else might use something differently.

So how does this all come together? What does this mean for the way work gets done? Well it means that we spend the majority of our time making sure things are done right and that all possible failures are taken into account. This means that new ideas we have for improving the way we do something takes a lot more time to implement. It took a while and a lot of mistakes for me to realize that while the process may be slow, the finished product of something that has had a lot of planning and thought put behind it will almost always work better and make the user happier than something rushed. I still struggle with this, and I expect I will continue to make mistakes in the future, but I believe that having these basic ideas in the back of your mind during your work will greatly affect the finished product.

Introduction

Hello! My name is R.J. Quillen and I’m a senior at Wittenberg University majoring in Computer Science and minoring in Math. My main interests include any sort of tinkering with computers, solving puzzles, and the emergence of electric vehicles. It’s my hope that this introduction describes what this blog is about.

In the Spring of 2014 I was hired on to Saks & Associates, a company ran by Dan Saks that “provides training and consulting services for C, C++ and embedded programming with C++”, to assist with automating various system administration tasks for a small network of computers running almost every major version of Windows since 2000. Since then, I have been working towards building a system which limits the amount of maintenance and time spent doing administrative tasks and provides a method in which these tasks can be automated to increase productivity. I believe that the design approach and overall philosophy we have taken over the past few years would work exceptionally well for many small businesses and some mid-sized home networks.

When I was first hired on, Dan had been doing all of the system administration by himself along with developing and delivering courses. Within the first few months of working for Dan, it became increasingly clear that I had to undergo a severe change in philosophy and the way I do work in order to excel at my job. I initially didn’t understand the system Dan had already put in place along with the ideas that drove it, but over the course of the past few years I now am a firm believer in Dan’s philosophy and the way he approaches problems and my goal with these first few blog posts is to drive those points home and to explain the approach we have taken with maintaining our network. It’s my overall goal with this blog to provide answers to others who are interested in the problems I work with and to also learn from readers who might have differing views and ways of thinking about problems.