Data Engineer: The Expert Closest to the Data
This technical profile, which appeared about several years ago, has become a key player in the success of Big Data projects. What do data engineers do? What are the skills of these highly sought-after experts for whom there is not yet a dedicated training program?
The data engineering profession is transforming and shaping the world of technology significantly. It is an emerging profession with enormous potential for evolution, importance and impact on business, and at the crossroads of innovation. This new expert profile data appeared in the last few years, since the companies are mature enough to industrialize their treatment of the data. It covers many aspects, but the common point is to manage data flows, so that the data is cleaned, available and usable by everyone.
Key Actor of the Project
If the data is the new black gold, it is still necessary that this raw material is exploitable. This is where the data engineer comes in. Depending on the projects and environments, it can intervene throughout the project life cycle. This makes it an ideal expert for small businesses, where versatility is sought after. First, it has a key role upstream to understand the customer’s digital environment, collect, prepare, store its data, and thus enable data scientists to value and develop models. It is also increasingly sought after projects to support data scientists in the industrialization of their models, re-read the code, optimize it, ensure the exploitability of their solution, scaling by helping them to perform automated tests, to deploy and order the treatments.2
These two missions are very different, and they contribute to make the definition of this new job a little vague. In general, data engineer intervenes equally with data scientists. But some prefer to focus on the upstream part. It also happens that data engineer are alone in the running when the work is limited to engineering, as to create a data lake, in other words prepare the data of a customer and bring them back to one point to facilitate their subsequent valuation by other systems.
Expert in Cleaning and Piping
What is this preparation of data? This is not necessarily new, some would say, but in the era of Big Data, it takes on considerable importance. It may, for example, be to feed a data acquisition platform by retrieving sensor measurements in real-time and storing them in an exploitable format. It can also include collecting data from different sources and systems, and homogenizing their storage format. Sometimes it involves scrapping, in other words automatic extraction of web content. The data engineer is curious and resourceful, pipe specialist!
The “cleaning” job consists of adjusting the data format so that it can be read by the data scientists, managing the missing data so that it does not disturb the subsequent processing, or standardizing the databases (names columns for example) to share with customers. The next step is to bring them to a place where the data scientist can request them. Companies often have very heterogeneous information systems (IS).The data is spread over multiple applications in different formats of varying complexity. A big part of a data engineer’s job is to explore the customer’s IS to find out how to recover the data before cleaning it to make it workable. It’s a thankless job, some deliverables are csv files, not nice websites, but it’s technically very interesting and fundamental.
This calls for a variety of solutions. It can involve programming, developing specific software or configuring software publishers. It’s also case by case. It is, in fact, a rather classic developer job, but with a Big Data set of skills. The key technologies are related to the storage and processing of large data like Hadoop for the long-term storage, Scala programming Spark for distributed computing and machine learning, Kafka for the Message queuing distributed (receiving messages or stored events before being used), etc. Knowledge of cloud technologies is a valued asset to learn how to take advantage of these infrastructures when the customer rents this type of resources.
A Very Appreciated Profile, and Very Difficult to Recruit
It will be understood, the data engineer is brought to work with many interlocutors: other data engineers, data scientists, data analysts, operational. Data engineer has a team spirit, likes to share, transmit, explain and justify his choices, their methods, the difficulties. They must also deal with all the political aspects inevitably linked to the data, just because data engineer manipulates them, moves them, anonymises them, etc. their action can go as far as directing the data governance of the company.
Where are these rare pearls found? This is the problem! The recruitment market for these indispensable profiles is even more tense than that of the data scientists because there is not yet a data engineer training course to learn how to handle large volumes of data, with cases of uses, in real time, etc.
As a result, we often find in these key positions people who already have a few years of experience, have a general background, and have worked on various tools: they are former developers, profiles from Business Intelligence, sometimes data scientists. But also profiles from other sciences such as theoretical physics (who have a solid background in mathematics, have often manipulated large databases and decide to delve into the code) .
Junior data engineers, meanwhile, come from computer schools or general engineers with a path halfway between computer development and Big Data technologies. But Spark or Kafka are generally not taught, Hadoop flew over. To become a data engineer, you have to want to learn by yourself, and then be constantly watching over these emerging technologies that evolve very quickly.