The term “data migration” is usually understood to mean the one-off manipulation of large volumes of data arranged in batches, with more or less long latency from one treatment to another (a new migration occurs every 24 hours for example).
While such data migration may be effective in managing large amounts of data, its use is much less justified in the case where the data is transmitted at a smaller volume and in continuous flow. One of the risks involved is to start a processing action on stored data when they are already out of date.
Data streaming is the expression used to characterize this continuous stream of data, which is less broad in terms of volume, and which only moderately accommodates conventional methods of data processing.
What is Data Streaming?
As you can see, one of the first elements of “streaming data” is to send data sets in a continuous mode rather than in spaced batches. In this sense, opting for data streaming proves to be relevant for data whose beginning or end cannot be clearly identified. Let us explain with the help of an example: a traffic light is supposed to emit data in a continuous way, the luminous signals of color are not destined to go out definitively at a given moment.
Data streaming therefore applies well to data sources that send data in continuous streams, at moderate volume (of the order of one kilobyte) and at the very moment when data is generated. These sources include: telemetry operations from connected devices, log files generated by users during their various journeys on an application or a website, e-commerce transactions, or any information from social networks.
Let’s complete these different elements of definition with a metaphor: that of the river. From the edge of the shore, the eye does not focus on finding the beginning or the end of the stream. Much more he will pay attention to its flow, the way the river flows.
How to exploit this resource?
The use of data streaming is optimal when the duration is a preponderant dimension of the data you want to analyze or when, with data to support, you want to spot the appearance of trends over time. For example, one can think of monitoring the duration of a web session. Overall, most data from the Internet of Things (IoT) is compatible with data streaming: traffic sensors, transaction or activity logs, and so on.
These flow data are frequently used for real-time aggregations, sorting or sampling operations. Data streaming allows analysis and monitoring of data in real time, which allows the generation of information on a wide range of activities (counting activity, server activity, device geolocation, clicks on a site web).
Let’s imagine the following scenarios:
- A bank monitors market developments and continually adjusts certain parameters of its client portfolios on the basis of predefined conditions (for example, sell whenever the value of a share crosses a certain threshold);
- An electrical device monitors the capacity of a network in real time and issues alerts when certain thresholds are exceeded;
- An e-commerce site generates high frequency clickstream analysis to detect abnormal behavior and activate an alert system if necessary.
While the exploitation of data streaming undeniably leads to concrete applications, the power of this method still stumbles on some frequently encountered challenges when working with streaming data sources. Here are a few that are worth keeping in mind if you are embarking on a data streaming project:
- Make sure of its scalability;
- Learn about the duration of your data;
- Incorporate fault tolerance in both the storage and processing layers.