Every cook knows that the success of a culinary masterpiece lies largely in the ingredients. It is difficult to prepare an excellent meal if the ingredients are bad or if their combination is not adequate.The same applies to data analysis. If the data is incomplete, inaccurate, or unrelated to the problem to be solved, it will be difficult, or impossible, to create a model.
For example, if the client value calculation model assigns a low score to some profitable customers, because online transactions or special orders are not taken into account, there is the risk of losing some of the best customers. The effectiveness of a data analysis model is therefore directly proportional to the quality of the data. In other words, you cannot cook a good meal with bad ingredients.
“From being prepared to handle the sheer volume of information to knowing how to make the most of it, companies need to be ready before they start collecting. To help you find the right tools and build the right processes” (Forbes)
Data analysis uses statistical algorithms and automatic learning to find information that can help solve the daily problems faced by companies. When users do data analysis, they typically implement mathematical algorithms such as neural networks, decision trees, and other complex statistical techniques that are used to search for trends in data. While these algorithms are an important part of data analysis, it should be noted that these tools look for trends in any data, Irrespective of their ability to represent the behaviours and trends attempted to model. For this reason, data preparation is one of the most critical steps in data analysis and yet it is often one of the most neglected steps.
The first step in data preparation is to collect data on the problem to be solved. If a user has a data lake, the process is considerably simplified. On the contrary, if the data is stored at various locations, it is necessary to explore several sources in order to identify the data available to solve the problem.
As soon as the data to be analyzed are defined, they should be integrated, evaluated and eventually transformed to ensure they are valid from a conceptual, coherent and statistically analyzable point of view. For example, if the data comes from different sources, many problems with formats and definitions will have to be solved.
“Whether gathering data on the front end or making big decisions, every single person in your organization must buy in to the value analytics brings. If not, you run two major risks. First, you could end up with dirty data, which is worthless when it comes to making good, solid business decisions. Second, you could amass tons of amazing data insights that are never utilized by your executive teams.” (Daniel Newman)
Even if a user is lucky enough to have a data lake, the data it contains will probably not fit as it is for the intended analysis. It is then necessary to isolate and prepare data for the model. This means working collaboratively with analysts and data experts to define the elements that are needed to realize the model.
It is important for each variable to define whether to use all data or only a subset. It is also necessary to define a strategy to deal with outliers (non-standard data) or to develop a model based on these values. For example, if the goal is to predict attendance rates and income from sporting events, it is certainly necessary to eliminate abnormal numbers of people due to particular events, such as a transport strike, etc. On the contrary, in the case of detection of fraud, it may be relevant to focus on some outliers, as they may be fraudulent transaction representation.
“Data visualization allows Big Data to unleash its true impact…Data visualization is the necessary ingredient in bringing the power of Big Data to the mainstream.” (Phil Simon)
Once the data is selected, it should be analyzed, Using descriptive statistics and visualization techniques to identify quality problems and better understand the characteristics of the data. Data quality issues can be highlighted, such as missing values that can impair the integrity of any analysis model. It is then necessary to compensate and correct the problems identified. For example, if data are missing, the best method for abandoning or replacing these missing values should be determined. Some data analysis techniques can be used to estimate missing values on the basis of other measured values.
There are many techniques that can be used to get better models. It is necessary to create “derived” variables, to replace missing values, or to use aggregation or data reduction techniques. It may be necessary to look for the best aggregates or new analytical variables to construct an optimum model. For example, when preparing customer data for a new loan marketing program, the ratio of debt to income may be a better indicator than income or debt alone.
Finally, the data must be transformed into a format adapted to the analysis algorithms. Many data analysis algorithms require the transformation of classification data (non-numerical) into numerical data or the reduction of that data into a particular range. Some statistical algorithms and techniques also require that numeric data have specific properties that may not exist in the data prior to transformation. For these variables, they may need to be re-encoded or transformed to produce the appropriate variables for data analysis techniques. Thus the value of the data is directly proportional to the time and care devoted to their preparation in order to solve a particular analytical problem.
Like what a cook would say when preparing his dish, the quality of the final result depends largely on the ingredients, but it is clear that the processes described above can only be successfully implemented by teams competent.