As an integral part of Big Data, the “bad data” are false friends who hide in the vast data repositories and that must be avoided. This is incomplete, false or unusable, it is appropriate to identify before they interfere in decisions.
What is “bad data”?
Statistically speaking, bad data includes:
- Poor quality – because the meter is faulty or inaccurate, or because the records will be missing for many individuals. This requires the management of outliers or missing data that may complicate the analysis.
- Poor relevance – the data in relation to the matter. While many predictors are independent of the target to predict, they will weigh exploration models and their training without making any predictive value.
- Data redundancy – If we see a bit of everything and anything and include it in the models “just in case” we may encounter some difficulties. Thus, some algorithms, particularly regression to predict a continuous value making assumptions about the independence of these predictors and can malfunction if this assumption is violated too significantly.
This last point is the most difficult to manage because it is not easy to detect. Having redundant data means the data contain common information. This often results in a high correlation.
A high positive or negative correlation between predictors and target is a very good thing, even the base that allows the prediction of the target knowing the predictors. However, the correlation between predictors is a problem. For example, your house has a higher value because it is larger or because it has more rooms or bathrooms? These variables are necessarily related (we will never have five bedrooms and four bathrooms in a 600 square feet unit) but what prevails? What should you do as work in priority to value your house?
Why “bad data” is an issue?
“The right information to the right person at the right time to make the right decisions.” This famous statement by Michael Porter on issues of economic intelligence is more relevant than ever today. If we reserve all the internal or external data repositories, whether they are structured or not, more and more voices will warn us against the risks associated with bad data. Many items that remind that during an analysis, a small misinterpretation of a word can initially generate a large difference in result. Using the conclusions derived from these bad data can lead the company in the wrong direction.
Faced with the exponential growth of data, pure and simple elimination of bad data seems illusory. On the one hand, any data may be poor if it is regarded as bad prism. Secondly, a bad data is sometimes BAD (Best Available Data), which is not known. In other words, the latter may simply need to be enriched and verified by a data scientist to become the best available information.