Title

We are all about marketing, data, analysis, innovation and technology

Saturday, June 21, 2014

A Statistician's View on the IRS Investigation


If you have been following the congressional investigations on the IRS practices, you may have heard that seven individuals who are under investigation, all had hard drive failures on or about the same time.  In addition to that, the IRS further claimed all failures resulted in an inability to retrieve any data from each of the seven drives.  Here is a link to the Miami Herald from June 20, 2014 on the hearings in case you have been vacationing on a remote island the past week:  http://hrld.us/1lIktol.
 

Photo Source:  C-SPAN 

As a seasoned statistician and data analyst, I thought to myself, what are the odds?  Is this very likely?  Maybe it is.  After all, we live in a digital world and who among us hasn’t had a hard drive crash, or had a close friend who had a hard drive crash?  So I decided to do some investigation to determine just how long the odds were of seven hard drives failing, and following that, the probability that of all seven, no data can be salvaged.

As any statistician knows, this is a straightforward calculation.  It simply requires some data, a few assumptions and a calculator that can multiply.

Data Required
1.  What is the failure rate of any hard drive?   Hard drives do tend to fail more when they are older and in addition various manufacturers have different failure rates.  I found an article on Lifehacker providing some good information comparing some major manufacturers.  (reference:  http://lifehacker.com/the-most-and-least-reliable-hard-drive-brands-1505797966).

They also provided some normative failure rates across all drives.  (reference: http://lifehacker.com/how-long-your-hard-drive-is-likely-to-last-1462918832)  In particular, the last article says that if you consider a hard drive’s life in three segments, the probability of a failure in year 1 is 5.1%.  Past year 1, but before 3 years, the failure rate drops to 1.4%.  After three years the failure rate is 11.8%. 

2.  What is the probability of retrieving information on a crashed hard drive?   In looking into the various causes of hard drive failure, there are many things that can render a hard drive inoperable without damaging the disk itself.  If the disk itself is in tact, the data should be retrievable by strategies that include (1) taking the disk from the inoperable drive and swapping it into a drive that works or (2) taking a transient voltage suppressor (TVS) diode off of the circuit board causing the hard drive to come back to life long enough for one to copy files to a fully functional disk drive.  Joel Hruska wrote an excellent piece on these and other techniques, "Raising the dead:  Can a normal person repair a damaged hard drive?" (reference: http://www.extremetech.com/computing/133294-raising-the-dead-can-a-regular-person-repair-a-damaged-hard-drive 

What cannot be repaired is actual damage to the disk itself.  So in investigating the success rates of companies that do this sort of thing (and provide success rate metrics on their website), I found a company in Grand Rapids, Michigan called Data Recover that gives a success rate on data recovery at 95%.  I also found a German Company called Freecom which gives a success rate on data recovery at 98%.

Assumptions 
We will assume the crashes of the seven hard drives are independent.  Meaning that in an office setting, the event of a co-worker's hard drive crashing has nothing to do with the probability that your own workstation will experience a hard drive crash.  This is a relatively safe assumption.

Additionally, we will assume the ability to retrieve the information from one drive has nothing to do with the ability to retrieve data from another drive.  Again I think a safe assumption assuming we make a good effort to retrieve data on each damaged hard drive.

The Equation 
Assuming independence of events, the equation to compute the probability of seven hard drive failures is the probability of a failure on the first computer, multiplied by the probability of a failure on the second computer...and so on until the seventh computer.  If we assume the odds of any one hard drive crashing being equal to 11.8%, then the odds of seven hard drives crashing at the office would be 11.8% to the 7th power:

(0.118)**7 = 0.0000003185473901 
or converting it to odds
1 in 3,139,250

If we further factor in the probability of not being able to retrieve data on any of the seven hard drives we would need to include the 5% probability for each computer that the data would be irretrievable.  To do this we simply multiply the above figure by 0.05 to the 7th power:

0.0000003185473901 x (.05)**7 = 0.0000000000000002488651485 
or converting it to odds
1 in 4,018,240,425,000,000

To put this in perspective, the odds of an individual being hit by a meteorite has been calculated by NASA to be 1 in 20,000,000,000,000 (reference: http://www.theguardian.com/commentisfree/2011/oct/13/meteorite-space-earth).  And the odds you will win a Powerball Lottery is 1 in 175,223,510 (that is why most statistician never play the lottery by the way!)

What do you think?  Could it be true?  Would love to hear your thoughts and if you agreed with my assumptions. #lovestatistics #bigdata #numbercruncher 

Rhonda Knehans-Drake 
   Assistant Professor, New York University & CEO, Drake Direct

1 comment:

  1. I was thinking along the same line as you except for the data recovery piece. This analysis should be part of the discussion, it seems to me. Have you thought to contact interested parties?

    Gregg

    ReplyDelete