Who are the Data Users and How should They Access Data?
We have talked about the “7 Big Talent Jobs Created by Big Data” before. In this article, we will continue to discuss these data users and how they should access data.
Who uses data?
1. Data Analyst
The Data Analyst is required to evaluate the data via analytical and logical reasoning; (data mining), data crunching, data retrieval (DataViz) to extract a data set (e.g., customers to be targeted in a commercial campaign based on certain criteria) or to explain a fact (e.g. why are June sales weak?).
These populations are trained to use relational and decisional databases, as well as to create data mining and mining requests. They can revert back to data manipulation tools; create analysis dimensions to facilitate data mining.
2. Data Scientist
The data scientist is a new trade appeared this decade with the emergence of Big Data. In addition to the Data Analyst’s own activities, the Data Scientist is required to produce new data from Machine Learning models. For example: producing a customer risk score, predicting sales for the next three months, evaluating customer satisfaction from their mails etc.
The Data Scientist knows all the information persistence techniques (relational bases, HDFS, NoSQL) and knows perfectly the techniques of requesting and extracting the information. SQL has made a comeback since the Hadoop ecosystem adopted techniques similar to SQL like Hive or SparkSQL. Data Scientist is also familiar with programming languages (Python for Pandas, Java or Scala for manipulating Spark data frames) to transform data (usually in the form of matrices).
3. Business Analyst
The Business Analyst is an operational one of a trade considered taking over the work of a Data Analyst for the current exploitation of the data and to satisfy the regular demands of its trade.
These are users with advanced (both technical and functional) use of the data. They can adjust pre-prepared queries, even construct SQL queries (ad-hoc query), are able to build a dashboard report from the already prepared data, manipulate the data in an array and create processing (macro or via DataViz tools) and to make all this available to their own business.
4. End users
End users are “consumers” of data made available to them; interactions with data are limited by the application of DataViz or by the already prepared data set provided to them.
Given the characteristics and skills expected for each of these populations, we can evaluate their use of data access functions for each of the roles.
1. Access to data
Therefore, all users are not required to query directly in the Big Data database, which may have operational constraints (processing on very large volumes of data to be finalized for such time), technical constraints a relatively long request) or contains data whose unit information value is very low.
The information is thus organized in successive stages which consolidate and simplify the access to information, from the grossest to the most consolidated as described in the following diagram.
The lower level can accommodate any type of data (structured, image, text, and sound) then this information is synthesized, simplified and enhanced as one ascends in stages: its business value is growing.
At the lowest level, one can imagine finding all receipts of a supermarket chain, to the highest annual turnover with different axes of analysis (such as socio-professional category, obtained in particular by crossing the receipt with the loyalty cards).
2. Access to raw data
Access to the raw data is mainly reserved for the data scientist who sets up the data acquisition (real-time events, files etc.) and quality control (file format, number of columns / lines etc.).
Note that image processing techniques or language are used to extract structured information from semi-structured or unstructured information sources (image, text, sound). These can include identifying words, identifying shapes in images, or even detecting particular sounds. The objective here is to extract useful information from the raw support. For example, in order to interpret a level of customer satisfaction, comments are interpreted into categories (content / no content), which are then integrated into structured information.
3. Access to transformed data
The transformed data are derived from the raw data to which a series of specific processes (pipeline) is applied to clean, enrich or even aggregate the data. To do this, the Data Engineer works with a business manager / data manager (who knows the data produced on his activity area) and the other actors to collect their needs.
In the mining industry, four tons of earth must be treated to extract four grams of gold for the most productive mines: The Data Engineer is here a miner who must also extract from a large volume of data, information to high added value. We have gone from a dataset with billions of lines to a dataset of a few thousand.
By way of example, a raw data is a physical sensor measurement produced every millisecond by a fleet of one thousand sensors. The transformed data can be:
- data enriched via repositories (sensor location, sensor type, etc.); the level of detail is then maximum
- an enriched and aggregated data on one or more axes (e.g., average of the aggregated value by “city”); the level of detail is more “macro” but still compatible with what is requested by users.
The Data Engineer knows the techniques of large data processing and the characteristics of parallel and distributed architectures; the other actors do not necessarily have the computer knowledge specific to these environments.
The Data Analyst is able to better understand the data and their relationships. It will therefore use visualization tools to navigate in this layer and search for weak signals, hidden links and correlations.
4. Access to exposed data
Compared to the transformed data, the data presented correspond to a particular problem (case of use) or is sufficiently prepared to be directly available to a wide range of actors.
Most often, the data presented are consolidated in more traditional databases (RDBMS) and can be interrogated by analysts who do not have too much computer skills. The access is therefore made with the usual instruments of the analysts (SAS, PowerBI, Qlik, Tableau…)
To take the example above, the exposed data would be compared with the maintenance of the equipment (installation, repair, monitoring). In this way, we would be able to identify potential causes of anomalies in the park and explain why some measures are out of line. These data result from crosses and are applied to business calculations (KPI) and are most often exposed in operational control tools.