The Emergence of Data Lake – What, Why and When?
What is Data Lake?
The data lake is a powerful data architecture that large amounts of data are stored without processing within the original format and are available without delay for analysis ready. The data lake helps companies to remain competitive and meet the requirements of information generation to meet the business and operational challenges that were difficult to address using conventional Business Intelligence and data warehousing technologies.
Many business leaders are not sure how to begin the analysis of their data. A main reason is the lack of a comprehensive approach within the company when it comes to the use cases and business objectives of Big Data projects. Some companies are already experimenting with basic data analysis approaches; but most have not yet been able to carry this out in real time and across the enterprise.
Why Data Lake?
Data Lakes can handle any type of data in a large amount, the calculations are distributed over many nodes of a cluster. Thus, the data sourcing process can occur gradually without affecting the existing models.Within the Data Lake, data are partially processed by the business users to facilitate the work.
The traditional data approaches have forced all users to use a pre-designed data schema or data model. Whereas the data lake approach relaxes schema standardization and defers data modelling. As data volumes, data variety and metadata richness grow, the data lake approach can offer nearly unlimited potentials for operational insight and data discovery.
Bill Schmarzo listed the following benefits of having the data lake compared to the data warehouse:
- Rapid ingest of data because the data lake captures data “as-is”; that is, it does not need to create a schema before capturing the data.
- Un-handcuffing the data science team from having to try to do their analysis on the overly-expensive, overly-taxes data warehouse
- Supporting data science team’s need for rapid exploration, discovery, testing, failing, learning and re-fining of the predictive and prescriptive analytics that power the organization’s key business processes and enables new business models.
- Provides an analytics environment where the data science team is free to explore new data sources and new analytic techniques in search of those variables and metrics that may be better predictors of business performance
- Frees up expensive data warehouse resources and opens up SLA windows by off-loading the ETL processes off of the data warehouse and put those processes into the natively parallel, scale out, less expensive data lake
John Thielens provided four addtional benefits of Data Lake:
- Breaks down silos and routes information into one navigable structure. Data pours into the lake lives there until it is needed, when it flows back out again.
- Gathers all information into one place without first qualifying whether a piece of data is relevant or not.
- Enables analysts to easily explore new data relationships, unlocking latent value. Data distillation can be performed on demand based on business needs, allowing for identifying new patterns and relationships in existing data.
- Helps deliver results faster than a traditional data approach. Data lakes provide a platform to utilise heaps of information for business benefits in near real-time.
When should companies upgrade and invest in a Lake Data?
Brian McKenna mentioned four signs a company may need to scale-up and invest in a data lake:
- Operational complexity: In a pre-data lake environment, if a business is trying to scale its infrastructure but doesn’t have any option for additional FTEs (full-time equivalents) manager support, there’s a good chance that their data requirements will outstrip their ability to manage them. Traditional tier 1 data resources aren’t always pooled virtually, limiting the amount of storage an individual manager can cope with and making a clear case for a more flexible common storage resource, i.e. a data lake.
- Operational cost: When a company finds that business demands on IT keep growing even when it is trying to reduce OpEx. it is time to look at a new approach. The same operational overheads that limit the ability for additional, FTEs also result in growing OpEx for managing IT resources. In order to address these requirements, businesses either need more FTEs or to invest in additional third party support to monitor, manage, deploy and improve their systems. The latter approach scales an order of magnitude better – or more – than simply adding headcount.
- Production strain: Another key indicator of the need for a data lake is when existing analytics applications are putting a strain on the production systems of a business. Real-time analytics can be extremely resource-intensive, whether trying to derive insights through video analytics from dozens of HD video streams or poring through a vast waterfall of social content; dedicated resources are needed so that people trying to use the production systems don’t drop-off in performance. Data lakes are key to ensuring that real-time analytics can run at optimum performance.
- Multiprotocol analytics: A final key indicator that a business needs a data lake is when data scientists are running apps on a variety of different Hadoop distributions and need to hook their data up to them. Businesses will need multiprotocol support in the future as analytics experimentation carries on, and they need to plan for this with a data lake strategy.
Although not all companies today are ready for complex data analysis, most of them need to prepare for it for long-term competitiveness. It’s just a matter of time before other companies opt for a Data Lake.