The 7 Deadly Sins…of Data Analysis

Please pardon me while I try to marry religion and science. Christians have a set of ethics they abide by, and I think we data scientists should as well. Though I can’t threaten you with fiery pits of hell, I will warn you that if you commit these sins you’ll probably end up *gasp* wrong. Now what’s worse than that?

Here they are, your 7 deadly sins of data analysis:

1. Pride = “Thinking you know better than the data”

pride

Let’s face it, the reason we analyze data in the first place is because we don’t have all the answers. Intuition can often lead astray, and following a hunch can cost millions. Though it can be humbling to acknowledge our limitations and ignorance, we must respect and revere data. Even when it goes against our expertise.

Say you’re a heart surgeon trying to decide between two operations on a patient. From your experience, you believe that procedure A is superior, because you recall a lot of patients complaining after procedure B. However, there is a convincing body of evidence that suggests that procedure B outperforms procedure A on a variety of health outcomes. Do you ignore this data and go with your gut? Do you let anecdotes overshadow the facts?

2. Lust = “Relations with unclean, but enticing data”

lust

The data that winds up in analysts’ hands is often raw and very dirty. Missing values, outliers, and transcription errors abound. Though it can be tempting to ignore these problems, we must put on our chastity belts, and get cleaning.

The best way to do this is to take a look at your descriptive statistics. Make some histograms. Graph a few scatter-plots. Check your normality assumptions and try to identify any data points that seem out of whack. Otherwise, you run the risk of your results getting skewed. I’ll be the first to admit that cleaning data isn’t the most glamorous process, but if you want honest and meaningful results, it is a necessary evil.

3. Sloth= “Being lazy and only analyzing one metric”

sloth

Statisticians like shortcuts. The beauty of the “Average” for instance is that it packs a range of values and and numbers into a single, digestible digit. However, rigor is key in any statistical analysis, and data scientists need to make sure that they aren’t taking the easy way out. Though it is easier to say “Group A has more X than Group B” and collect your paycheck, it is crucial to always look at multiple variables and understand how they interact.

A study on sumo wrestler mortality illustrates my point here. On the surface, it is easy to say “Sumo wrestlers are obese, therefore they are unhealthy.” But, when you look at their rates of heart disease and diabetes, you’ll find that your conclusion couldn’t be further from the truth. Why? Because sumo wrestlers are active. Very active. Thus, we see that simply using the metric of “BMI” can lead to misleading results. But when we factor in other variables, such as “exercise,” we can draw a clearer picture of their health.

4. Greed=”Testing Too Many hypotheses”

greed

In statistics, there are rarely any certainties. It’s not like math where 1+1 will always equal two. Chance and probability are the bread and butter of statistics. A statistical test will not lead to a definite answer. Instead, we hope to obtain a p-value low enough to say something like “we are 99% sure that this is right.” We are never absolutely sure. There is always a hint of uncertainty. The unexpected, in a way, is expected.

A statistically significant p-value – let’s take the widely used.05 – means that there is only a 5% chance that we’d see results this extreme under baseline conditions. In other words, if you think you have statistics that buck conventional assumptions, there is a 5% chance that they are simply an anomaly and the assumptions should not be bucked! So, if you get a little greedy and start testing a lot of hypotheses at once, you’re bound to achieve statistical significance. For instance, if you plug 20 predictors in your model, there is a 65% probability that atleast one will achieve statistical significance, simply by chance! As such, you want to build a rigorous model that minimizes uncertainty and test your findings repeatedly.

5. Gluttony = “Converting too many data into too many dashboards”

gluttony

In the age of big data, numbers and figures are cheaper and easier to come by. But like fast food, we must be vigilant in how we consume them. If your boss asks you for a quarterly report, don’t cram every semi-relevant statistic you can find into the report. Instead, try to identify your KPI’s- key performance indicators. Focus on the statistics that matter most. The ones that will resonate and impact change.

Also, try to find a way to make the data zing! Get creative! Your boss may not be moved by bar charts and line graphs, so it important to put the human back into the data. In the book Switch: How to Change Things When Change is Hard, the authors describe a fascinating example of a data analyst who discovered that his company was wasting money and he wanted to put a stop to it. They were buying gloves from different manufacturers rather than cutting costs through buying in bulk. In lieu of creating a humdrum powerpoint presentation, the individual purchased all the different types of gloves and stickered them with their varying price tags. When the executives came in and saw this glaring visual representation of their mistake, they were quick to fix the error.

6. Wrath = “The data overlord knows best, leave logic at the door”

wrath

I’m putting this under wrath, because this issue never fails to make me angry. Attention everybody, statistics isn’t about plugging data into computers and letting them do all the work. There is critical thinking involved. There is a need for humans. Though we use p-values and t-tests in different analyses across different domains, each project has its own nuances and contextual factors that need to be understood. And as of now, even the most sophisticated computers can’t do it all.

If you have a large enough sample size, you can find statistical significance pretty regularly. But that doesn’t mean your results are necessarily meaningful. It is important to think critically about the degree of difference in your results. A 5 point difference in average SAT score between males and females may prove to be statistically significant, but would that difference be enough to claim definitively that one gender is superior in filling out bubbled scantrons? Alas, if you have a large sample size, you might want to lower your p-value and set higher standards for statistical significance.

7. Envy= “Coveting Thy Neighbor’s Data”

envy

Even we humble statisticians are prone to be a little jealous at times. We may wish to replicate our competitor’s success. We may desparately yearn to meet industry benchmarks. But, in doing so, it is important to recognize that we may be setting unrealistic goals.

When we perform analyses, it is important to be as objective and impartial as possible. Strategies that are effective with your competitors may be ineffective with your particular clientele. If we do not check our biases at the front door when beginning research, we run the risk of following false trails and getting misleading results. You should let yourself be surprised by the data and be comfortable with reaching unexpected conclusions. Let the data point you in the right direction. Don’t simply copy your competitors.

Surely, there are other sins and vices that may tempt even the most virtuous of statisticians. But by steering clear of these follies, we can ensure that our work will be found favorable in the eyes of God. And by God, I mean Nate Silver.