What way is to talk about statistics better than to start with an equation: data visualization = statistics + psychology.
The aim of this discipline being “to make the data speak” is to make the main information contained in a series of data comprehensible at a glance. It will therefore be published in the form of pie charts, histograms, etc.
Result: the errors that can be committed in this trade will be of two orders:
- badly describe the statistics reality .
- forcing the psychology of readers and introducing interpretation bias.
Nothing is preventing mixing the two in order to reach the top 10 of the most beautiful balls of dataviz, as they say in the trade.
Let’s review the most interesting.
1. Falling into the middle trap
You have inevitably already read phrases such as: “The average European drinks a liter of beer a day”, and you wondered who this mysterious “average European” was and where you might meet him. Obviously, it does not exist. In some countries, people drink more wine than beer. There are even people who do not drink alcohol at all, Children in particular. They cannot drink a liter of beer a day!
The people who make this kind of assertion usually start from a lot like “Every year 109 billion liters of beer are consumed in Europe”. And then they divide this figure by the number of days per year and the total population in Europe. This makes quite sense, if the data is normally distributed.
And when we say here “normally”, it is in the sense of a “normal” distribution in statistics, that is, in conformity with the law of Gauss, called the normal law.
This image shows three normal distributions. They have the same average. And yet they do not tell the same story at all. What the average does not tell you are the deviations around that average. In addition, very often, one does not even face normal distribution.
Take a data such as income. The notion of “average income”, for example, gives your brain a magic number. Half of the population earns less than this figure, the other half earns more. This is what our ear hears (wrongly) when we say this word and that is what our brain visualizes.
But it’s wrong. Generally, most people will earn less than this magic number, simply because revenues are not “normally” distributed. The word “normally”, again, have nothing to do with any opinion, but with the so-called “normal” statistical law. Their distribution does not follow this law.
This is the US income distribution in dollars for households earning up to US $200,000. We note that there is a long tail that distorts the perception of “average” that we have.
If the average income increases, this may be because most people earn more. But this can also be because some of the highest incomes earn even more.
Economists are familiar with this problem and have added another value to analyze things.
The coefficient of GINI gives indications on the distribution of income and this coefficient is now at the center of the debates of the economists working on the incomes.
By working on the data, you will very often have cases where the use of the average is problematic.
How do we tell a better story?
Express the results by indicating the differences: say “the European drinks between 0 and 5 liters of beer per day, with an average to 1 liter”.
Use the notion of “median”. Talk about the median salary, around which the two halves of the population are divided equally.
2. Take an Average that Says Nothing
Kind of particular case of the previous subject: to make speak an average to say nothing, or rather, do not make it says what it takes.
For example, let’s analyze the average monthly order in a store. The first graph shows that things are going well, since this average order value increases regularly, month by month.
Yet, if we analyze the different types of averages (the average order values by categories of customers), we realize where the good things are coming from.
3. Create a pie chart presenting shares of an average
It may be tempting in the above situation to synthesize these data by creating a pie chart showing the different average order values. But what is the meaning of such a graph?
A pie chart has interest only if it represents the different parts of a whole. But how can you present the different shares of an average?
The “contribution of each department to the overall average order value” represented by this example loses much more information than it brings and makes the reader’s brain work for… nothing!
4. Presenting pie chart in wrong order
From a strictly statistical point of view, the order in which you are going to present the different actors that “break” a whole does not matter. For example, in the colas market, Coca and Pepsi carve the lion’s share, followed by a bunch of small brands far behind, no matter in what order you present the data.
Well, that’s probably what the professor of mathematics said to you at the end.
But the brain of the readers to whom you are going to present the thing will not “capture” at all the same information in the order you choose to present your pie chart “that holds such market share.”
Hubspot has identified two possibilities for presenting a “speaking” speaking pie chart.
Option 1: place the number 1 at noon, and spread out the pie portion it represents clockwise. Place number 2 at noon (next to number 1 but to the left of it) and spread its part counter clockwise. The following ones are therefore placed below.
Option 2: place the number 1 at midday and spread out the piece of pie that it represents clockwise, then number 2 in the same direction, and so on, in descending order.
Look at these pictures and ask yourself what conclusion you draw from it in the half second that follows the moment your eyes land on the graph. Immediately, you will realize that your brain needs less effort to interpret what the pie charts say.
5. Do not be suspicious of “others”
Yes, you have to think about the others when talking about the “other” category. Often, in order to summarize and facilitate reading, one will concentrate its analysis on the main actors of a statistical series (for example the top ten), and then, as the sum of the whole 100%, we will summarize those that follow in a category “other”.
The problem arises when the “other” category is greater than the sum of all the previous ones.
For example, as shown below, the first graph lets you think that the top ten of this series embraces all the studied population.
The second chart, however, tells a whole different story, that of the theory of the long train.
However, in current economic models (and therefore in data visualization for marketing, for example), we are always on the lookout for phenomena of long trains. This is often where future profits lie. It would be a shame to lose them on the way.
6. Letting the reader do too much work
Do not play “the less I get wet, the better I do my job”.
In terms of data-visualization, your work needs to be seen. So do not leave the raw data, even if, in an impressionistic way, they express something. Do not forget to show the trends well , and as here, clearly mark the regression lines.
7. Misleading the size
To look for example of bubbles to well show the sizes of two different populations is perfect. However, if you want to show that the size of population B is twice the size of population A, it is the surface of your bubble #2 that must be twice the surface of bubble #1.
Now the surface of a circle is S = π xr²
We do not double the surface of a circle A by multiplying its diameter by 2!
8. Do not build random samples
There is an increasing tendency to use statistical series from sources such as Internet “surveys” (on-line questionnaires).
Problem: The statistical validity of such series is zero if the sample of the responding population is not randomly drawn (or constructed by reconstituting randomly, according to the quota method).
Of course, the larger the population that responds, the less this bias plays, because of the law of large numbers, but many really mean many. And the current trend, especially in the use of these pseudo-polls is to consider that, even small, the number is rather “big”. This is inevitably biased.
9. Conflict correlation and causality
Many epidemiological studies show that women on hormonal treatments (HRT) also have a lower number of coronary heart diseases. Great doctors have deduced that hormonal treatments are protective against coronary heart disease.
So we continued the investigation; says this Wikipedia article, with random samples and tests on these samples, and then we realized the opposite: taking HRT increased the incidence of coronary heart disease.
In continuing the survey, it was found that women who were enrolled in hormone therapy were more likely to belong to higher socio-economic groups, with a better balance of food and better hygiene life.
In other words, the phenomena “taking HRT” and “occurrence of coronary artery disease” themselves had a common cause, but had nothing to do with each other, as one might think at first reading.
Confounding correlation and causality is a logical error like so many others.
As soon as a correlation has been established, it is necessary to go further. In particular, using statistical tools such as the Granger Causality Test or the Convergent cross mapping (CCM) Test.
10. Do not take false positive into account
Imagine that you have just installed a hyper-performing control system in your shop. It has a 99% accuracy to identify flights.
The alarm is triggered. What is the probability that the person who triggered the alarm committed a theft? You will respond spontaneously: 99%! It’s tempting, is not it? Yet this is not true!
In your shop, there are honest people and thieves. Suppose there are 10,000 honest customers and 1 thief. Each of them will pass control.
The alarm will be triggered 101 times.
100 times since the alarm has a reliability of 99%. It will therefore trigger erroneously in 1% of cases. 1 time since the alarm will detect the real thief.
The probability that a person who triggered the alarm is a thief is therefore 1/101 = 0.99%
To overestimate the probability of an event of this type is a scenario known to psychologists. It is the mistake of forgetting the context base rate fallacy. It is the source of the misinterpretation of the important “false positives” in mass detection systems (airport security, etc.).
It is not enough to take into account the efficiency variables of the detection device (the means of measurement). It is also necessary to take into account the characteristics of the overall population itself. The probability that the event will occur depends on both. Not just its measure.
Another example is that a doctor will be less likely to be mistaken in diagnosing influenza in one of his patients, although the doctor will certainly have studied the symptoms of the illness in the patient but he will also know that has an influenza epidemic in the population at the time he makes this diagnosis.