How to Analyze Social Network Data – 5 Steps to Follow
We are part of the millions of users who feed social networks with our personal data. The following five key points will provide you some advice to turn these volumes of data into knowledge and identify trends to power digital applications.
1. Access data from major social networks
It is possible to exploit the data of social networks (Twitter, Facebook and Instagram) by programming interfaces (APIs) made available to developers. Beyond the technical facilities, the recent regulations on the protection of the privacy made this landscape complicated.
- Twitter gives access to different versions of its APIs. A first version, public and free, can be used in two modes: “search” and “stream”. The first allows non-exhaustive search tweets published in the last seven days, within the limit of 450 requests per window of 15 minutes. As for the second, it allows to recover tweets in real time. There is also a premium version that allows access to all tweets since 2006 and a corporate version that also offers dedicated technical support. With the free version, it is possible to make requests by account or hashtag using access tokens and retrieve all information from the tweet: content, author, date, description, geolocation (if present) and number of each type of interaction.
- Facebook makes available its graph of social relations which references users and interactions. Queries are also made using a token that is obtained once the Facebook account is certified. Despite several successful experiences of Facebook data mining, we have faced recent changes in the terms of use: increased control of account identity and removal of features (search by theme). This implies that the only way to collect data on pages about a topic is to specifically target them by having Facebook’s approval for that application. It is then possible to retrieve all the information from the public pages and their posts, with the exception of personal information.
- Regarding Instagram, the API is almost inaccessible in recent years. The lock is a policy on the part of Instagram which requires that the developed application has a proven utility for the users of the platform. However, we discovered the existence of an official URL that allows access to a top 50 posts associated with a tag (keyword). But we do not have access to user identification or number of followers.
Even though it is possible to continue to obtain data, access is increasingly limited in line with recent policy choices (RGPD).
2. Filter and update the data
Once you have access to the data set up, you must be able to restrict the scope to the topic you want to analyze and therefore apply a filter. It is actually a lot of work if we want to make sure of the relevance of the content we are analyzing. This involves finding a way, given the constraints of the APIs, to effectively filter the data to be collected.
Twitter and Instagram essentially work with keywords (hashtags) and users. It will therefore be necessary to anticipate a time of data analysis in order to identify themes and populations of interest. The lack of possibility to make requests by subject on Facebook forced to make requests by having a priori knowledge of the pages of interest. In comparison to Twitter and Instagram, a biased view of the pages of interest can then be source of incomplete trend analysis.
One of the interests of social networks is that it is possible to filter the data according to the number of interactions on a given theme. It is important to have up-to-date interactions. Indeed, when recording posts through APIs, we retrieve them at a given moment in time, and therefore with the number of “likes”, “shares” etc …they had at the time of collection. The Twitter API also has a “stream” mode, which allows you to retrieve the tweets in real time, but this requires to continuously update the number of interactions. A way around this problem when working in “search” mode is to only retrieve posts with a certain number of seniority days (by studying the average lifetime of a post on each network), so most of the interactions that must take place at the time of collection will have already occurred.
3. Be aware of the populations analyzed
When using data from social networks, we must always be aware of the population we are considering (unless we specifically target the social media population). Basing yourself on social networks for analysis implies a fairly strong bias, which must be taken into account when using this type of data for a study that targets the general population. The audience varies according to the network and the subjects studied: one generally finds a young population, particularly on Instagram and a little less on Facebook and Twitter. It may be more difficult to get an exhaustive view of the behaviours of some age groups. Social networks, despite their wealth of data, are therefore not systematically relevant.
4. Analyze the data
The data of social networks are of two types essentially: texts and images. It is rare that one can directly analyze the texts without having first cleaned them because generally, the users pay little attention to the way they write. It is thus possible to identify typical structures (URLs, brands, accounts, nominal groups, emojis,…) by exploiting regular expressions to keep or exclude them as needed. However, if advanced grammatical analyzes are to be anticipated, it will be necessary to keep as much as possible the structure of the sentences in the text in order to be able to correctly identify the grammatical function of each word. A punctuation that delimits sentences is therefore important in this case. The clean-up phase of text associated with social networking posts is therefore usually an important piece of work for a relevant analysis.
5. Ensure the viability of the analysis
Building an application that takes social network data as its source implies a strong reliance on the APIs that are made available. If one day Facebook, Instagram and Twitter decide to restrict or completely close the access, the application cannot be updated, or at least not without further developments to adapt to the changes. For example, changing the Facebook API to no longer allow searches by subject may require new developments to replace obsolete functionality and introduce per-page collection on a specific topic. It is also possible that the social networks used are no longer the preferred source of content sought in the future. In the context of an application put into production,
Exploiting the ocean of data that circulates on the opinions and uses of the population within social networks represents a real opportunity: it is a voluminous source of free and easily accessible data. Although currently we must ensure the representativeness and viability of applications based on social networks, they represent a preferred source for trend studies. In fact, as the 15-34 year-olds are primarily interested in social media; it is interesting to continue to exploit their data, which will increasingly represent the behaviour and opinions of the general population.