The paradigm (or buzz word, depending on perspective) of Big Data has been formulated around the four Vs:
- Volume – Vast amount of data generated every second
- Velocity – Speed at which new data is generated and the speed at which data moves around
- Variety – The different types of data we can now use
- Veracity – messiness or trustworthiness of the data
While scientists and engineers around the world are working to control the 4V of this veritable tsunami data, it is important to consider less obvious new scientific questions that could structure the research in this area. There are six main “big challenges” for research on Big Data.
In developing a Big Data solution to a given problem, data scientist is much like an alchemist. Data preparation, selection and parameterization of the data processing method (or, more often, a succession of methods), choice and parameterization of results visualization tools, interpretation of results and their uncertainties… there are many complex steps and entangled where the data scientist’s knowledge and expertise will be employed.
The invention of notation for describing chemical equations was a real boost to the development of modern chemistry, as the invention of the notations of modern algebra was a real boost to the development mathematics.
Thinking about Big Data operations, should we invent an algebraic notation system to simply program of a sequence of Big Data processing?
Data Mining specialists know that the accumulation of data is not sufficient to extract useful information. The question of separating the “signal” of “noise”, the “Information” of “chance” in a large mass of data remains wide open in the scientific world.
If economic or political decisions are made on the basis of Big Data analysis, techniques of digital spam or scam, which will intend to massively generate false data in order to manipulate aggregate information or biased decisions, will appear. For example, a false tweets generator to create a buzz that naturally come out of the tweeter analysis robots, or a browser plugin that issues false Google queries to obscure the view of Google that actually makes the search engine user, undermining the data model of Google and AdSense.
The fights against “spam data” will soon become necessary to preserve the value of Big Data.
Furthermore, the integration of the temporal dimension is poorly treated in Big Data. For example, to extract real-time information in a wide stream of data or events (fast data) remains difficult. The expected explosion sensors from the Internet of Things (IoT) will generate growing data streams over the Internet that exceeds, by far, the growth of our capacity to store data. We must therefore choose which raw data to keep, and what data to forget.
The Big Data and Data Mining technologies have historically treated as data type with simple structures – arrays of numbers (e.g., age, salary, number of telephone calls …), tables attribute (male or female, city of residence …), graphs (which is related to whom?).
But the problems of very large datasets cover more data types in a much wider range of structures. For example, the data could be images, videos, large corpus of books, representations of geo-spatial information within the Geographic Information Systems (GIS), representations of the parameters of the physical world of sub-atomic scale, representations of the living world settings, for the deployment of a protein, to the dynamic biosphere complex in its entirety, representations of sophisticated technological objects …
The explosion of Big Data effectively quantifies a growing part of our physical world and feeding models and decisions in a growing number of areas. Big Data enables outlier analysis of physical phenomena such as weather or astronomy, extracting knowledge from Web document processing, analyzing interactions based on social graphs… Nevertheless, the great mirage of the Internet amplified by the Big Data has created risks to consider that the real is reduced to what is represented in the form of data.
The “mirage” of Big Data should not blind us to what, in the real world, has not yet been quantified.
A final point of focus is on the many human factors related to Big Data. Fascination and rejection mechanisms caused by these new technologies inevitably lead a procession of myths and fears.
Faced with the limited capacity of the human brain to represent large amounts of information, new problems on massive data visualization and representation arise.
Our cognitive mechanisms are built for millions of years to get the maximum information from a very small amount of data, and our intuition will be tested by the capabilities of these new massive data analysis.
Statistical Sciences in the eighteenth century showed the limits of our intuition – the question of the Big Data analysis capabilities further enrich our cognitive mechanisms open.