They have been used for many years to transform data from their production systems into piloting indicators, and organizations have become experts in dashboards and other dashboards distributed at all hierarchical levels of organizations.
An explosion of data volume that is just beginning
With the advent of social networks, the widespread use of mobiles and the emergence of IoT (Internet of Things), data has become plethoric, rich, and very diverse. This phenomenon has been theorized under the name of Big Data. Organizations need to properly manage this new mass of data, if they want to stay in the race and not be overwhelmed.
This explosion of the volume of data is only in its infancy. According to IDC, 200 billion connected devices will emerge, and so every person on earth will generate 1.7 megabytes of data per second by 2020. I like to remind that it is estimated that 180 zettabytes the weight of digital data to the 2025 horizon is still “a little” less than the number of stars in our observable universe which makes us feel very small despite all our data production!
Big Data without intelligence, it’s not much
As a complexity never comes alone, the technical mastery of these data is not enough. In order to be able to transform them into value, we will have to “make them talk”. For this, these huge data sources must be channeled then “tamed” by visualizing them (data visualization), interpreting them (descriptive analytic) and modelling them (predictive analytic) with different approaches (statistics, machine learning…). The “smart” overlay of Big Data has emerged: it is Data Science.
The “transformation in value” of the data will both become more complex and multiply the capacity for value creation. This creation of value will first of all be achieved by identifying data that can improve, grow, or reinvent…the activity of the organization.
Identification and integration of new data sources
Identifying new sources of data is fundamental today in a data-centric approach. Of course, some of the complexities of data collection, storage and diversity and governance need to be addressed.
In a pragmatic approach, the first step will be to value existing data in the organization’s production systems. We will then direct the collection to external data (Open Data , data providers, RSS feeds, social networks…) in order to enrich the analyzes, deepen understanding of behaviours and detect changes.
Business issues (quality improvement, reliability of devices, security, etc.) also lead organizations to collect data from sensors and connected objects (IoT) produced by people and machines. These data, by their sources and heterogeneous formats are often stored in “silo” systems. And all the interest lies in “breaking” these silos to better exploit and value them.
While channeling huge volumes of data is not at the center of the organization’s business model, simply crossing external data sources adds value to the company’s analytics.
Data Lake allows democracy and avoids anarchy
The first reflex, to be able to manipulate external data, is often to directly reconcile the different flows in the analysis tools without implementing a centralized solution. Indeed, the analysis tools are easy to use, powerful by their computing capacity “In memory”, and already able to manage a certain volume of data. But this mode of operation, although it may be an interesting crossing point for working the culture of external data, will find its limits fairly quickly.
It does not allow the sharing of data sources that can be valuable for the entire organization. Above all, if it is badly governed, data repositories will appear everywhere and a very anarchic mode of consumption.
We are regularly confronted with situations where there are multiple direct connections of analytical systems to operational systems. This mode of operation degrades their performance and even triggers denials of service.
Setting up a recommended Data Lake
In order to democratize the use of data at different levels of organization, we recommend setting up a Data Lake. This concept is a way for organizations to implement a structured and unstructured data storage platform from various internal and external sources. To be integrated, these data must be qualified both for their reliability and their added value for the organization.
A question that is very frequently asked is the cohabitation of the Data Lake with the enterprise data warehouse. The answer is clear: yes, the two components must coexist because they have two different roles:
- The Data Warehouse is used to produce the shared metrics repository for controlling the organization.
- The Data Lake is the receptacle of all qualified data for all types of consumption.
The Data Lake will therefore become the source system of the Data Warehouse. But it will also allow the management and integration of streaming data that can open a “real-time” mode of use of the data.
We will also see, in a future project, the importance of maintaining the centralized repository of the organization to avoid that there are as many indicators as consumers of data!
What Data Lake implement?
There is no absolute truth about the technological direction a Data Lake should take. However we can mention three types of bricks that can interconnect and coexist:
- Big data distributed Spark and/or Hadoop platforms, which allow managing natively the varied and voluminous data,
- the NoSQL databases that can manage log and/or semi-structured data
- and finally for problems of unstructured data presenting a large volume, it is possible to use indexing engines with semantic analysis of the NLP (Natural Language Processing) type.
To ensure optimal maintainability and exploitability of the Data Lake, the implementation of a data tracking and traceability system is recommended…Otherwise the Data Lake is transformed rather quickly into a “Data Swamp”.
Data Lake, direct data access for advanced users
Unlike the operational data store history, which was a technical preparation sub-layer for Data Warehouse feed data, the Data Lake will be open to some advanced users.
In order to transform and normalize the Data Lake data, it is possible to connect data preparation tools. They aim to qualify the data quality “technically speaking”. The goal here is to exclude ill-formed or outliers, as well as to identify possible interactions and crossings of data.
But most of all, it will provide self-service BI users with qualified data sources that can be used to cross-refer to already-calculated indicators and raw data sources. This mode of operation allows both advanced users to have new data sources and not to randomly plug all desired sources.
The Data Lake also allows the implementation of “Analytics Sandbox”, potentially ephemeral, which will give data scientists a strong analytical capacity. This autonomy in the use of data, supplemented by the provision of qualified and reusable data sources, allows data scientists to focus on the core of their activity: the implementation of mathematical models that create value.
The Data Lake is the technological brick of the data strategy
Setting up the Data Lake is the most technical chapter of the data strategy. It therefore requires a thorough architectural analysis. Its implementation can be approached in various ways, and the purpose of this publication is not to put in place a technical implementation guide. However, the following tips can be used to avoid some pitfalls frequently encountered:
- Make the “Data Lake” project a 100% technical project. This component, while an underlay of the architecture, becomes the company’s data repository. Advanced users will directly draw data sources. Its implementation is therefore as much a business project as technical.
- Underestimate governance in its implementation and use. For example, direct opening to Data Lake should only be done for advanced users. And its use corresponds to ad hoc uses that should not replace business reporting.
- Put all the data in the Data Lake without qualification. It will very quickly turn into swamp and be unusable.
- Do not believe that the Operational Data Store (ODS) is a Data Lake. The Data Lake must replace the ODS and not the other way around. The ODS, is a technical only sublayer. It is not made for direct use by the business, and would be very quickly unusable and saturated with large volumes of data.