A Data Lake without Enterprise Architecture is A Leap into the Void
The Big Data, Data Science, Data Lake triptych has generated the hope of new business opportunities, based on better exploitation of the supposed value of the data.
This new Eldorado inspired by technologies created by the GAFA is often perceived as a purely technical solution. However, it seems necessary to address the essential contributions of enterprise architecture in the valuation and success of a Data Lake through three distinct chapters:
- The urbanization activity of information system that anticipates the place that Data Lake will occupy in the “application landscape” and in the organization of the information system.
- The activity data modeling and implementation of standards in order to keep control of its Data Lake and prevent the transformation of the lake (Lake Data) marsh (Data Swamp).
- The data and usage driven change management activity, to be able to transform the future innovative ideas coming from your Data Lake into real competitive advantages for the organization.
Indeed, if the Data Lake technically dismantles the data silos and opens the door to global and instantaneous analyzes that were inaccessible until then, it does not automatically have an advantage over its competitors.
Each of the following chapters recalls that decompartmentalization of the data is only useful if its information system and organization are modified accordingly. Not by a dogmatic approach that imposes a framework, but by a concertation around the future changes of the organization.
If resistance to change, the one that annihilates good intentions, shatters your Data Lake initiative, it’s likely that some of the following points have been underestimated.
Urbanization: Big Data’s place in the Big Picture
If we stick to a technical definition of Data Lake – a storage space for raw data from which Data Scientists perform relevant analysis – the adoption strategy is often the same. A modern scanning environment is created and storage space is added. It follows a progressive feed in data flow.
Types of use and data quality
A Data Scientist uses all the data for experiments, even the erroneous ones and the “uncertain ones”. A product manager wants consolidated and viewable data daily. Marketing wants segmentation on the fly to offer the best products. Sales people are sensitive to time-to-market, from idea to commercial exploitation.
These different modes of operation evolve over time and define logical separations in the Data Lake. Not silos, but use constraints that a Data Lake does not embark by default, and that require a certain maturity in the Data Lake urbanization. Data Lake is not a standard “off-the-shelf” computer product. Defining an ecosystem is necessary to decouple these uses. Stream exchanges, process orchestration, supervision, access permissions, all this is necessary for Data Lake to evolve in step with the rest of the IS.
Collection process
By repatriating all the raw data in the Data Lake, the industrialization of data collection cannot do without a global reflection on the priority qualities expected from the exchanges and the resulting urbanization.
And not all companies will have the same constraints. A telephony operator can generate one million data per second from a thousand different sources, some of which are subject to legal traceability requirements and others to accounting obligations. A small mutual will painfully generate five thousand data per day, but some will be sensitive health data.
Others will have their data in software packages, some of them in SaaS mode. Still others will exploit the data of partners of varying reliability. More than 30-year-old IS companies will go through layers of encapsulation of their mainframe. And all these constraints can combine. Urbanized logic is vital.
Data security
The Data Lake also aims to circulate a very important part of the data of the company for uses whose number, the nature and the end users will have to evolve. This cannot be done without automation of traceability, supervision and security of exchanges and storage. It is very often a legal obligation.
Identity and Access Management (IAM), the Management API, and their integration with sensitive or regulated data are topics that the enterprise architecture and the RSSI must orchestrate.
Which modules in the Data Lake ecosystem?
Other structuring elements of your IS must be taken into account in the Data Lake urbanization:
- The “data catalog “, associated repositories and their life cycles,
- The orchestration of the processes surrounding the management of the creations, evolutions or disappearances of the sources and destinations,
- The physical transport of data, the management of transaction integrity and uniqueness, error recovery…
Data normalization must find its place around a Data Lake that favors the raw data of origin. Pushing it downstream in the processing chain or coexisting old and new chains in parallel, the choices depend on the constraints and expectations.
Each IS is specific, this list is far from exhaustive
Setting up a Data Lake without taking into account the impacts, constraints and opportunities generally leads to a poor fit with the strategic issues and the anticipated needs. Who else but the business architect to give the necessary perspective to the definition of the end-to-end solution that meets your requirements?
Repositories: know yourself
The main advantage of Data Lake is also its main disadvantage: it breaks data silos by accepting any data without supervision or governance. However, it is very risky, in these days of RGPD, to let anybody access any data.
As much as it is easy to dump undenatured data in a persistent layer accessible by authorized persons (the Data Lake in its purified form), as much without any reference systems that allow the measurement of the value of the different data sources lose control of the content of Data Lake and all its possible uses.
The repositories are mainly data reference data, metadata. Know which data is available in which source. Discriminate reference data, operational data and operating data. Know the refresh rates, available versions, managers, classification, ways to view them.
The use made of it in Data Lake is also an essential element to avoid the creation of “logical silos” replacing the “physical silos”. The repository can make it possible to know the person in charge of access authorizations, the place where it is used, its use in experiments, processes or reports, the technical, functional or professional references…
If a data is in several sources, it is necessary to know which source refers (“golden source”), the applications having a local copy, those able to update the reference and the rules of propagation of the modifications in the IS, the mechanisms detecting and remedying inconsistencies between sources …
It is not possible to list here all the information that, in one context or another, may be relevant. But it is the enterprise architecture that defines the scope and limits of this data governance.
This governance must ensure that the use of data does not reflect the old technical silos. It also makes it possible to make the Business, functional and technical experts contribute on how to use the data they know well. Their commitment and involvement are largely part of the decompartmentalization.
Data Lake technology could be able to accept any data without supervision or governance. But the organizations that took advantage of this opportunity to no longer monitor or put in place a governance found themselves with a Data Swamp whose management is more complex, the benefits more uncertain and operational risks incommensurate with those of a Data Lake under the control of corporate architects.
Continuous transformation and control by the data: start by aligning in both directions
We miss opportunities when the alignment between the information system and the business is always at the expense of the IS. The technical complexity that is invisible to the evangelists and the difficulty of making the IS adaptable to unforeseen requirements make alignment difficult.
In the case of a Data Lake, when different Business actors access the data catalog and associated services, the Business aligns itself with what the IS makes available to it. By opening its catalog and being able to display what it is technically possible to provide, the IS streamlines the requirements of the business. It owes it to the urban base that provides technical control of flows and governance for the functional control of data.
Certainly, work done by the IS is always necessary to align with the need. But this need will be more naturally framed and the technical impossibilities will be much rarer.
Then mark the spread of innovation
Just as DevOps is the missing link between two worlds with poorly compatible operations (development and exploitation), so is missing an important step between Data Science – which extracts the value and validates the relevance of a new one, and the use of a dataset – and the Business that expects a quick release of this segmentation, this visualization or this publication.
Your Data Lake may bring you many ideas for new uses of your data heritage. It is rational to put in place simple processes for the industrialization of these different types of uses.
A “DevOps Data” with tools that are closer to the configuration management than the management of a continuous integration. It is less a question of injecting new application versions into the information system than of making use of different uses of maturity in the same information system. From Data Lake, it will be possible to continuously enrich APIs for internal or external use, or to automate the creation of Data Sets for BI and reporting purposes. This DevOps is implemented mainly around:
- Process orchestration tools,
- An extensible library of connectors,
- From the algorithmic work of Data Scientists,
- Rights management and access to different services,
- More classic DevOps source, environment and deployment management.
- From converters to output to provide useful formats.
The enterprise architecture associated with a Data Lake allows you to create robust, professional and scalable software without getting into prioritizing the needs of the many, not your specific needs. Your Data Lake becomes the central element of an application addressing your customized innovations.
Data Lake and transformation
This ongoing, enterprise-wide evolution will make Data Lake a success. This paradigm shift may be problematic, but the need for change management leading to organizational change is rarely seen; even though the Data Lake comes from GAFA and startups whose culture and organization are often the opposite of the matrix organizations that are currently capturing Big Data and Data Lake.
This mode of operation will disrupt entire areas of your organization, change the decision-making processes, the scope of responsibilities, the internal and external modes of communication, the life cycles of products and services, and the control of releases. These novelties are anxiety-provoking and will lead to resistance and avoidance strategies.
It is precisely to adopt more easily a transversal approach affecting the technical architecture, the business processes, as well as the support of the transformation of the organization, that the enterprise architecture was created.
Setting up a Data Lake is a leap into the unknown. Enterprise architecture is your parachute.