The term “Big Data” appears in the media every day and in our daily lives. It is promising and it illustrates more broadly the growing importance of statistics for decision makers (e.g., the data-driven business). It also keeps most fundamental of traditional data analysis. To obtain meaningful results, it is better to master some basic concepts of statistical approach.
Data Analysis – Why?
Statistical analysis can have three types of objectives:
- Describe – This is the case for factorial studies or clustering aimed; for example, at segmenting its prospective population in order to target a marketing campaign.
- Explain – This can be illustrated by clinical studies of risk factors associated with a given pathology or, on the contrary, to establish the efficacy of a treatment.
- Predict – For example, the promise of the Big Data, which will make it possible to identify who will be interested in what product or when. However, it is not absolutely necessary to enter the Big Data to carry out a predictive study, unless we want to rely heavily on the analysis of unstructured text data.
The ultimate goal of these three objectives, which is common to all statistical processes in companies, is to develop decision-making tools for the decision-maker(s). These approaches are covered by the term “Business Intelligence (BI)”, which mainly traces the past activity of the company, and now is supplemented by the term “Advanced Analytics”, for the predictive aspect turned towards the future activity.
However, there is one thing that cannot be answered by statistics – to obtain absolute certainty. By definition, there will always be at least two sources of uncertainty in a statistical study:
- The bias, or approximation error, that comes from the simple fact that we draw conclusions from only a fraction of the population we are interested in. This fraction constitutes the sample studied. These approximations can be reinforced by rounding or simplification specific to the mathematical methods used to handle complex cases.
- The variability intrinsic to the phenomenon studied, measured in statistics by the term variance. This is why two true twins, raised under the same conditions, will not necessarily want the same gift at Christmas or that two people following the same treatment will not necessarily have the same clinical course.
Once this general framework is established, the objective of the study /analysis and the means of obtaining and/or the data already available to conduct it lead to the most appropriate method, according to various technical and theoretical contingencies.
Data Analysis – How to Do It?
It is easy to apply an inappropriate or mathematically invalid method without realizing it, in the absence of the theoretical background. Two types of approaches exist when conducting a statistical study:
- Classical approach – Once the working hypotheses have been applied, a so-called experimental planning approach is used to determine which data should be recovered, in what quantity, with which controls and how to limit the bias and optimize decision. The data are then collected as part of the study and this case typically concerns R & D or regulatory studies.
- More Opportunistic Approach – It is typically the approach of Data Mining, which is now possible to be applied to Big Data. In this case, we can exploit the mass of CRM customer data, Facebook profile information of Internet users accessing our services or accounting data to better define purchasing behaviour and customer expectations. For this approach, the data pre-exist and were collected in an initial framework different from that of the study.
Data Analysis: With What?
In our framework, we can designate two different things by tool:
- The mathematical method – It is briefly mentioned above, but is not the main subject of this article.
- The IT tool – It is characterized by its user interface, richness and flexibility, either for automatic calculations or for sophisticated data analysis.
Too often, technical aspects constrain the data analysis:
- Instead of having the right tool, business users have to adapt to the available tool.
- Or the company may have already invested in a powerful platform, potentially are able to carry out all the data analysis, but it requires additional programming that the users do not master.
- A third point comes from the choice of method takes over the problem to solve. For example, “because we know, because the others are like this, because we have always done it”… then it can end up with a test result which does not necessarily correspond to the question we really wanted to explore.
Faced with its analytical needs, there are three major solutions the company can choose from:
- Outsource statistical processing.
- Use generic software or platforms to deploy and adapt to internal needs.
- Develop your own “tailor-made” tools, internally or have them developed by a service provider.
This choice will have a strong impact on the level of statistical knowledge required by employees, in order to optimize the use of the tools put in place, but also the scope of the results obtained.
Data Analysis: With Whom
The importance of knowledge in statistics for data analysis in a company has been highlighted in the previous sections. This knowledge covers the mathematical theory and its practical application through software tools. Many brakes can explain that data mining is not optimal, such as lack of comfort with the mathematical theory underlying or force of habit routines in place. A new powerful tool, requiring a significant investment, can remain unexploited.
These brakes will be even more dangerous in the context of a Big Data project because everything is bigger – the volume of data, the variety of their nature, acquisition speed, which also determine the validity of the analysis. Blockages often come from a lack of training in these fields, for example, the mathematical tools and software, or the lack of tools that could compensate.
How to change practices and make the most of the company’s data? It is possible for the trainer today to develop interactive, allowing trainees to experience for themselves the major concepts of statistics, their wealth and points playfully vigilance and practice.