Introduction to NLP: Sentiment analysis and Wordclouds

Juan Ruiz Arnal


No Comments

I think one of the most interesting areas in the data analysis field is Natural Language Processing (NLP). These last years this discipline has grown exponentially and now it’s a huge area with a lot of problems we can attempt to solve, like text classification, translations or text generation

In this post, I will show one of the simplest ways to approach to text processing. I’m going to focus on a particular kind of text classification: Sentiment Analysis. It consists of a 3-class classification depending on the sense of the text, which can be “positive”, “neutral”, or “negative”. Trough an example you will learn two simple technics to get insights from texts: Natural Language Toolkit (NLTK) and WordCloud.

The Data Set

We will use Google App Store data, available in Kaggle. The reason I chose this data set is that we can get a lot of insights from it as it contains its own sentiment prediction so we can compare with it. Also, you can combine sentiment analysis with other features that I will not use here, like rating, and see if there are the relations that someone could expect.

The data set is composed of two CSV files, one containing mostly numerical data as a number of installations, rating, and size but also some non-numerical data like category or type. The other file contains 100 text reviews for each app, this last data set is the one we are going to analyse.

Fist of all, I import the basic libraries we need. I will import the rest of the libraries just before use them.

# allow us to directly load a .zip file
import zipfile 
import pandas as pd

# As we are working with long texts, we set the corresponding option to visualized all
# the data complete
pd.set_option('display.max_colwidth', -1)

import matplotlib.pyplot as plt

1. Sentiment Analysis

For this purpose, we will use the Natural Language Toolkit (NLTK), more specifically, a tool named VADER , which basically analyses a given text and returns a dictionary with four keys. three of them describe the fraction of weighted scores that fall into each category: ‘neg’, ‘neu’, and ‘pos’ for ‘Negative’, ‘Neutral’, and ‘Positive’ respectively.

The last key is called compound and as its own name says is a combination of the other three.

To use VADER we need to download and install extra data for NLTK, several of its tools actually require a second download step to gather the necessary collection of data (often coded lexicons) to function correctly.

import nltk'vader_lexicon')

Now we define an auxiliary function that we will use in order to keep the code clean and readable. It just transform the compound score into one of the following: ‘Negative’, ‘Neutral’, or ‘Positive’, depending on a threshold. Having that the score is 1 for the pure positive sentiment, 0 for pure neutral, and -1 for pure negative, I choose the threshold to be 0.33 but feel free to change it and see how to affect the results.

# now, we import the relevant modules from the NLTK library
from nltk.sentiment.vader import SentimentIntensityAnalyzer

def classify_compound(text, threshold=0.33):
    # initialize VADER
    sid = SentimentIntensityAnalyzer()
    # Calling the polarity_scores method on sid and passing in the text
    # outputs a dictionary with negative, neutral, positive, and compound scores for the input text
    scores = sid.polarity_scores(text)
    # get compound score
    score = scores['compound']
    # translate the score into the correcponding input according to the threshold
    if score <= -threshold: return 'Negative'
    elif score >= threshold: return 'Positive'
    else: return 'Neutral'

Then we can finally load the data and start. As we are focusing the analysis in the text reviews we will remove all rows without data in this field and we keep only the features we need. Also we create a new feature with the expected sentiment using the above function

# load zip file
zip_file = zipfile.ZipFile(r'./')

# load mumerical data .csv
num_data = pd.read_csv('googleplaystore.csv')).drop_duplicates()

# load text data .csv with reviews and apply columns restrictions, 
# also, we drop duplicates and any row with nan values in the column Translated_Review
valid_text_columns = ['App', 'Translated_Review', 'Sentiment']
text_data = pd.read_csv('googleplaystore_user_reviews.csv')).drop_duplicates().dropna(subset=['Translated_Review'])[valid_text_columns]

# create a new feature based on compound score from VADER using our function "classify_compound"
text_data['compound_sentiment'] = text_data.Translated_Review.apply(lambda text: classify_compound(text))

# merge both to have all features available at the same time
df = pd.merge(num_data, text_data, how='inner', on='App')

# Visualize a random row to see all features together

2. Quick Stats: Original vs VADER

For this part, we just need some basic pandas. Of course, you can make these parts as long as you want until you understand the data completely. But for now, we are going to focus on the difference between both predictions: the original one and the one we made using VADER.

First, we see the number of predictions of each class that has each metric. According to the results, we can guess that positive sentiment is easier to predict than the other two.

After that, we take only the rows that have differentoriginal sentiment and compound sentiment.

pd.concat([text_data.Sentiment.value_counts(),text_data.compound_sentiment.value_counts()], axis=1, sort='True')
mask_differents = text_data.Sentiment != text_data.compound_sentiment
differences = text_data[mask_differents]
print('There are {:.2%} different values between Sentiment and Compound Sentiment'.format(differences.shape[0] / text_data.shape[0]))
print('Showing 5 random rows with different original sentiment and compound sentiment:')

We can see above that none of them really works perfectly and we cant check all the different rows. If we had to bet on our predictions, we should probably try to improve them by tokenizing the words, use some more complex algorithms like BERT, or maybe remove all the different ones.

At this point, you should check any feature that you think might help you to understand better the data set. For example, you can check if there is a bias in the predictions depending on the Category using the group by function and concatenating the results. Do you need a little push to move forward? I’ll give you a fast way to start:

pd.concat([df.groupby(['Category', 'Sentiment']).count()[['App']].rename({'App': 'Original'), 
           df.groupby(['Category', 'compound_sentiment']).count()[['App']].rename({'App': 'Compound')], axis=1)

3. Words Comparison

Using the Wordcloud package we will be able to generate an image that gives us the most representative words (actually the more common) in a chosen set of reviews.

First, we define the words that will be eliminated, I use the set already existing in word cloud and add some extra words to it. After it, we generate an image with the 100 most repeated words with at least 5 letters in it.

# Import all necesary libraries
from wordcloud import  WordCloud, STOPWORDS, ImageColorGenerator

# Get stopwords from wordcloud library
stopwords = set(STOPWORDS)

# Add some extra words ad hoc for our purpose
app_words = ['app', 'apps', 'application', 'game']

# join all reviews
text = " ".join(review for review in text_data.Translated_Review)

# Generate the image
wordcloud = WordCloud(stopwords=stopwords, background_color="white", max_words=100, min_word_length=5).generate(text)

# visualize the image
fig=plt.figure(figsize=(15, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Total Reviews Word Clowd')

To finish our experiment we plot the WordCloud structure for all sentiments and metric. But we will distinct two extra cases, rows that have same original sentiment and compound sentiment (named with EQ) and rows that don’t (DIF).

fig = plt.figure(num=12, figsize=(12, 10))
for i, sentiment in enumerate(['Positive', 'Neutral', 'Negative']):
    if i == 1: i = 4
    elif i == 2: i = 8
    # join all reviews with 
    original_text = " ".join(review for review in text_data[~mask_differents][text_data.compound_sentiment == sentiment].Translated_Review)
    compound_text = " ".join(review for review in text_data[~mask_differents][text_data.compound_sentiment == sentiment].Translated_Review)
    original_text_dif = " ".join(review for review in differences[differences.compound_sentiment == sentiment].Translated_Review)
    compound_text_dif = " ".join(review for review in differences[differences.compound_sentiment == sentiment].Translated_Review)
    original_wc = WordCloud(stopwords=stopwords, background_color="white", max_words=50, min_word_length=5).generate(original_text)
    compound_wc = WordCloud(stopwords=stopwords, background_color="white", max_words=50, min_word_length=5).generate(compound_text)
    original_wc_dif = WordCloud(stopwords=stopwords, background_color="white", max_words=50, min_word_length=5).generate(original_text)
    compound_wc_dif = WordCloud(stopwords=stopwords, background_color="white", max_words=50, min_word_length=5).generate(compound_text)
    fig.add_subplot(4, 4, i + 1)
    plt.imshow(original_wc, interpolation='bilinear', alpha=1)
    plt.title('EQ Original ' + sentiment)
    fig.add_subplot(4, 4, i + 2)
    plt.imshow(compound_wc, interpolation='bilinear')
    plt.title('EQ Compound ' + sentiment)

    fig.add_subplot(4, 4, i + 3)
    plt.imshow(original_wc_dif, interpolation='bilinear')
    plt.title('DIF Original ' + sentiment)
    fig.add_subplot(4, 4, i + 4)
    plt.imshow(compound_wc_dif, interpolation='bilinear')
    plt.title('DIF Compound ' + sentiment)

As we can see there are pretty much the same words with almost identical relevance in each category. The good thing is there are clear patterns for all different sentiment. The bad thing is we don’t know what words are making the prediction different, we could guess that these words are going to be some of the less common words.

In spite of the words are very similar we can see some differences in the importance patterns. But that’s not important at all, as all I wanted is share some easy tools to can give the first steps for your own and get your own conclusions, so now you just get into it don’t stop! I share with you a couple of questions in case you want to go deeper with these data set:

  • Is there any threshold from which original and compute sentiment predictions are the same?
  • According to our compound score, What apps have better reviews? The free ones or the ones someone have to pay?