Although it is far from being a new concept, text mining is now experiencing a new boom. The emergence of Big Data, where digital data storage is no longer a problem, and where the availability of data sources is increasing, makes the analysis of textual data a crucial issue.
What is text mining?
Text mining brings together all the techniques of data management and data mining allowing the processing of particular data such as textual data. By textual data, we mean for example texts, answers to open questions of a questionnaire, text fields of a business application where customer advisers enter in real time the information given to them by customers, emails, posts on social networks, articles, reports…
One of the central aspects of text mining is to transform these unstructured textual data – if not the language used – into data that can be exploited by conventional data mining algorithms. It is simply a matter of transforming a raw text into a table of data indispensable to the analysts responsible for making sense of it. It is then a question of deploying the statistical methods most likely to answer a given problem.
How to choose good statistical software for textual data processing?
Most data mining tools allow the processing of textual data. They are able to facilitate the implementation of tables of data to be analyzed such as lexical tables or contingency tables for example; but also to graphically represent, using data visualization, and the specific indicators of the textual data.
The differences between these IT tools lie in their answers to these functional questions:
1. Is it a statistical tool capable of structuring textual data?
The tool must be able to quickly and intelligently build the necessary data tables. Interfaces to clean the text, ie, to select the words of interest and remove all the words tools (“there is”, “a”, “an”, “in”, etc.), must be intuitive, relevant and effective.
The lemmatization phase whose challenge is to reduce the vocabulary by grouping the synonyms, by grouping conjugated verbs under the same root, by deleting articles, linking words, etc. is very tedious. It must be easily achievable with the use of automatic grouping algorithms in which the notions of syntax similarities (words with the same root) and semantics (words belonging to the same lexical field) are taken into account. The essential possibility of business adjustments must not, however, be neglected.
2. Is text mining software capable of handling specific phrases or groups of words?
The solution must make it possible to treat the specific expressions or groups of words as verbal entities in their own right. For example, the two words Data and Science when placed side by side should be considered as a single entity: the “Data Science”.
3. Can the text mining solution identify the contextual elements of a word?
It is important to be able to easily observe the contextual elements of a word. For example, the question “what other words are used by customers with the word dissatisfied?” should be easily addressed by a text mining tool. It is then possible to specifically identify the different reasons for customer dissatisfaction.
4. Does it offer a range of adapted statistical analyzes?
The implementation of specific statistical analyzes, whether descriptive or predictive, facilitates decision-making. For example, methods of multidimensional exploratory analysis and classification can be used to bring out in a global or local way the essential information and the main concepts underlying a text; and machine learning algorithms to organize texts according to their content in an automatic and self-learning way.
5. What are the common applications of text mining?
Text mining allows analysis based emails sent to a company and to respond a priori to the question “what are the main reasons for contact? “ It is then possible to establish predictive models to automatically classify incoming mails in the different categories of identified contact patterns. This automation makes it possible to quickly send the request to the right service or to the right person and consequently to increase customer satisfaction.
A new topical challenge is related to the detection of sensitive data contained in free text areas of business applications (CRM for example). Sensitive data is related to racial origin, political opinions, religious beliefs, sexual orientation, health, etc. of each client, collaborators, partners, etc.
A multitude of other issues can be addressed with text mining such as marketing campaigns with the analysis of certain marketing actions (social networks, contact forms), customer relationship management and particularly loyalty through the study, satisfaction questionnaires or the optimization of web content with a view to natural referencing.