4 Essential Components and 6 Challenges of Big Data Analytics
The explosion of connected objects opens the door to enable huge amount of data being collected, stored and analyzed in real-time. This Big Data has created big challenges in many ways: From the problems of managing very huge volumes of data and analyzing and cross-linking these data volumes to the quick analytics solutions.
Components for data mining
These unprecedented breaks are strategic opportunities for organizations in general and for IT departments in particular. In order to grasp them, it is necessary to acquire the approaches (technologies, processes and skills) to collect, clean, paired, explore, analyze and restore knowledge on volumes of data that will explode exponentially as accelerated deployments of connected objects. In terms of tools, the essential components are:
- A central data repository: to dump corporate data, usage data and digital paths into a horizontally centralized Data Lake to support exponential growth in volumes, uses, and latency constraints. This Data Lake must offer advanced means to search for data and evolve rapidly in order to absorb at least cost new flows or new links.
- An agile laboratory: equipped with access to these data repositories, with adequate tools for exploration, visualization and analysis, data scientist teams will have to respond rapidly to the new needs expressed by the professions with a maximum of autonomy. This requires an ability to understand the needs of the trades, design curricula, manipulate data, develop mathematical models and communicate results to trades in an intelligible form.
- A platform for industrialization: Whether at the exit of the agile laboratory, or on the occasion of responding to a new need, the data repository is intended to irrigate raw or transformed data (In the sense that they are in production and subject to operational and administrative constraints), whether for reporting purposes, forecasting or interactions.
- Data governance: all organizations have a vocation to evolve in their core business to become a provider of internal data, centrally and locally, as well as externally (customers, suppliers, communities, open data …). This requires the implementation of true data governance: life cycle, value, quality, calculation rule, ownership, accessibility, conservation, anonymization, auditability and documentation…
The difficulties of setting up an exploratory approach
The main pitfalls of this vision are in execution:
- Competencies: Analysts must understand and interact with the business, master data management and statistical tools, a rare combination of skills.
- Autonomy: for the exploratory and analytical phases, it is essential that analysts can carry out their processes on the laboratory in full autonomy, in the spirit of the “self-service BI”. They must therefore master the tools of analysis, visualization, exploration and transformation, and know the data and metadata of a repository as exhaustive as possible.
- Speed: if the computer time-lag are too long to integrate a new flow, for example, the job will quickly shun the Data Lake to create a new redundancy of data or processing. It is essential that the repository be designed, implemented and maintained in a way that is not agile with very short deployment times.
- Test and learn: analysts must develop a relationship of trust because it is indispensable during the inevitable iterations necessary to find by successive adjustments the right balance between underfitting and overfitting, between model readability and predictive performance.
- Governance: a decision-making body must be able to impose transversal rules (e.g., anonymization), arbitrate competing needs (e.g., network and customer) and pooling between different departments or uses of certain flows, quality treatments or calculations.
- Hybrid: the Data Lake “schemaless” meets the need for agility on the data. Conversely, structured models are better adapted for data governance and in particular for control data. Teams should therefore be able to address a “hybrid” environment (i.e., SQL and NoSQL).