Big Data and Small Information, How They Come Together?
We have heard here that Big Data = 4V ( Volume , Velocity , Variety , Variability ) is a definition by characteristics that does not stand up to a thorough examination of the question because these characteristics are subjectivity-prone. We also read that Big Data is a data characterization that requires an innovative and original approach to their processing by the establishment of dedicated infrastructures.
Big Data, a buzzword and an idea not so new
To demystify a little these two definitions take the case of a mobile phone operator at the dawn of the 21st century. This operator is active in an emerging market in a country that does not have the fixed telephone network infrastructure (fiber network core and switched network in copper pair). Finally, this operator works almost exclusively with prepaid offers.
Before we heard of the word Big Data, many companies already handled thousands or millions of communication tickets, transactions on a user account or calls to the support center per month. Each of these events was recorded in relational databases (or as a log file directly), stored and analyzed.
We were not talking about Big Data, yet the problems of data collection, storage, quality and analysis were all present and led to innovative solutions for the processing and analysis of this data.
An even older problem can be considered as Big Data given the innovative approach that has been adopted to achieve its solution. The US Census of 1890 was to collect and store 50 million individual records with no fewer than 80 variables. This event led to the invention of the Hollerith Tabulator and indirectly to the founding of IBM. Like what, every era has, had and will have its Big Data moment.
Big Data vs. Business Intelligence
For Big Data vs Business Intelligence, only scale and time have, respectively, dilated and accelerated. The million events per month is still present but the order of magnitude is now 100 million and the time of the order of the second. The 50 million individual records have become 500 million graphs oriented.
And, again, this concerns only a few high-traffic players whose needs for real-time data analysis are significant. One thinks of the well-known social networking sites but also to the online sellers whose recommendation engines fuel to Business Intelligence in near real-time.
The need for those who drive and decide whether they are business leaders, long-distance captains or space mission managers lie not in the data but in the information contained therein, and information, it is necessary to have the right level. Too little is uncertainty. Too much and it becomes indecision. It is also necessary to have accurate information only to be aware of it at the appropriate time.
The acceleration of trade induced by increased interconnection and more efficient transport technology has further reduced the reasonable delay between the occurrence of an event and its detection and presentation to the decision-maker.
This delay is known as the latency of the event-detection-presentation cycle. If most of the historical decision chains were satisfied with 24-hour latency, this period is no longer acceptable because the characteristic period of many business processes is of the order of a minute or even a second, Millisecond in the case of fully automated processes.
Once this is done, one can nevertheless say that one remains in a classical problematic of analytical extraction and synthetic presentation of information.
What is no longer classic is the multiplicity of sources that are often external and the growth of the mass of data to be interpreted. It is this growth characteristic that imposes the need for scalability characterizes what is called Big Data.
Big Data is an approach to Business Intelligence characterized by:
- A horizontal integration of internal and external data in varied and mutable formats,
- A distributed architecture for data storage and processing,
- A very weak latency of the event-detection-presentation cycle.
Difference between noise and data
If a signal is structured, it is data and it contains information (as to whether this information is relevant, that is another story). Otherwise, it is a noisy signal. Of course, a signal whose structure changes remains structured and it is the detection of these changes of structure that will make all the difference. Any part of chance and irregularity in the structure of a signal increases the noise of this signal, it is certainly the meaning to give to this expression of “unstructured data” finally: data more noisy than average.
We were told that video signals, facsimiles of documents and scanned photographs are unstructured data. But it is not totally correct. These signals come in the form of files or streams perfectly structured as decodable by a computer. The confusion may lie in the ability of the machines to extract relevant information from these contents.
Let us take another case to illustrate this point of view. A CCTV camera captures images and encodes them in a video signal that can be recorded and played back on any computer.
A human who looks at the images can identify faces, situations or places. This process is a process of data analysis and information processing. It is possible only because the images convey information that humans know how to interpret. Thanks to the ability of its cerebral visual areas to recognize forms and faces and thanks to the ability of its parietal cortex to stick labels on these shapes and faces, face recognition techniques show that such a process can be automated.
Other than a needle in a haystack
What characterizes Big Data in addition to the rest is that the data are finally all considered with the same importance regardless of their format.
Why? Because the time when we knew what information we were looking for in the data is gone. The complexity of the data and its increasingly rapidly increasing mass mean that new information which is not suspected is contained in these data. It is the metaphor of the needle and the haystack except that in our haystack is hidden something other than a needle and we do not know what it is.
While the terms volume, velocity, variety and versatility are insufficient to define what Big Data is, they nevertheless remain objective constraints, imposing the actors of corporate BI to define new approaches for the collection, storage and analysis of data.