Speech Recognition: How Does It Work?
Like many in recent years, you have experienced the massive democratization of personal assistants. In the same way, seeing a friend or co-worker giving strange orders or asking funny questions on the phone has never been more normal than today. As you may have noticed, we are in the era of cognitive technologies, and voice recognition is one of them.
Many people talk about it, but who really knows how it works?
To better understand this process, we must understand what it is composed of. In total, five technological bricks form the process of speech recognition.
The first step that initiates the whole process is commonly called the Wake-up Word (WUW). This is not necessarily a voice command; it can also be a button to press or other interaction between the user and the machine. The main purpose of this step is to activate the speech recognition (STT that will be explained later), or “wake up” the system so that it starts recording. This is all the more important when we look at the context in which we are, today people are afraid of technology for fear of seeing their privacy and privacy baffled. Thus, without having to perform the action or pronounce the necessary words, the voice recognition will be in standby and will not record any tracks.
Once the system is active, it is necessary to exploit the word. To do this, it is first and foremost important to record and digitize it via the speech to text (STT): to recognize it simply! At the end of this step, the voice is translated into sound frequencies (e.g., like music) that can be interpreted by the system. In order to improve the understanding of these frequencies, different treatments are operated:
- Standardization in order to suppress peaks and troughs in different frequencies in order to harmonize them.
- The removal of background noise to improve the audio quality.
- Segment cutting into phonemes (distinctive units, expressed in thousandths of a second, to distinguish words from each other).
The frequencies can be analyzed by a previously trained neural network (Deep Learning): an algorithm capable of analyzing a large amount of information and constituting a “database” listing associations between frequencies and words. This allows, through a statistical analysis in particular, to match a frequency to the most common word and therefore theoretically the most accurate.
For example, let’s take two sentences “a glass of water” and “a class of water”, the one that will be retained will be the first because “glass” is more used than “class” with “water”.
Once the voice recognition and the different treatments are done, the data is sent directly to the NLP (Natural Language Processing) system.. The main mission of this technology is to analyze the sentence and extract its meaning. To do this, it starts by associating tags with the words of the sentence, this is called tokenization. These are actually “labels” that are affixed to each word in order to characterize them.
The importance of the NLP lies in its ability to translate textual elements (i.e., words and sentences) into standardized orders (always in the same format) that can be interpreted by Artificial Intelligence in addition. To concretely achieve the stated order, the AI is the centerpiece. Artificial intelligences work in different ways, some more basic than others.
- The contexts (where is she? Why? And with whom)
- The information (the objects, known users, the current state of the objects, inventory, schedules, etc.)
- The external services (access to external actors APIs like: order food, have the train schedule, do an internet search, listen to streaming music etc.)
The idea is to group these different elements and make links between them in order to obtain results that are intended to be relevant and effective. Here is a (very basic) illustration of AI in the context of home automation:
- Context: Home, Controlling Connected Objects, for Users
- Information: Lamp, Refrigerator, Shutters, Television (lit), Heating (26 °)
- External services: Weather, Wikipedia, SNCF
The TTS (Text To Speech) concludes the process. It corresponds to the feedback of the AI which is characterized by a sound, a voice or a displayed text for example. The latter makes it possible to communicate information to the user, symbol of a complete human-machine interface.
Once the cycle is complete, an individual can converse with the machine and give him orders. Summarizing, the sentence is captured, then interpreted and then executed as an action that gives feedback to the system (voice feedback or not).
The most experienced of you will have understood, this article explains in a very simple way a complex technology. The idea here is not to make you experts in this field but to make you aware of the functioning and its articulation.