Artificial Intelligence

TF-IDF: summarising news with python

T. Fuertes

02/03/2022

No Comments

In the world there are several pieces of news which help us understand how the market’s behaviour will be. As human beings, it is impossible for us to read each piece of news or comment, so what if we use machine learning, especially tf-idf algorithm, to get the most important information from all the texts written on the net?

This post is a start point to use the known tf-idf algorithm to get key information from the net. The purpose is to encourage you to take advantage of it and include it in some investment strategies or any other use.

TF-IDF

The term TF-IDF comes from Term Frequency – Inverse Document Frequency. The goal of this technique is to count how many times a word appears in a document.

The first part, that is TF, computes the number of times you see each word. However, this may make some mistakes when you are computing the value for a common word like the article “a”. This kind of words is called stopwords and they refer to articles, pronouns, auxiliaries and so on. These words allow you to build an understandable sentence, but those don’t really have meaning. Therefore, as they are widely used, they will have a high TF.

With that in mind, the technique tries to correct this error with the term IDF. This part modifies the TF taking into account the number of times that a word is used in a great deal of texts. In that sense, if a lot of texts use a word several times, its Tf will be penalised and it will be lower.

As an example of the result of tf-idf, let’s take these sentences: “I’m traveling to Paris this week”, “There will be so many journeys to Paris next week”, “Paris is supposed to receive a lot of tourists”. Let’s assume that we’ve removed stopwords, then the frequency of Paris word in each sentence is 1 over the number of words in each of them; however the importance of Paris over the rest of words in the sentences is huge because Paris appears in all of them.

To sum up, this technique gives us a measure of how important a word is in a text.

Practical case

Let’s see a very simple case to use TF-IDF to identify the main topics along the history.

Dataset

We take a set of news and opinions from several texts in English from 2009 to 2020. Our set contains the date, the title, the body of the news and the stock which is referred. In this post, we’ll only use the date and the title because the goal is to take the main topic.

Preprocessing

When you deal with texts, it is relevant to process the text to avoid noise in your model. That is, the texts should be clear and in the same format. Therefore, we’ve processed our original texts in the following way to obtain the processed texts by using the library nltk in python:

  • All words in lower case.
  • Remove numbers and initials.
  • Delete stopwords.
  • Remove puntuation.
  • Lemmatisation. This means to reduce the word to the root synonym of a word. It is possible to determine whether the root comes from an adjective, a verb or a noun, thanks to de input “pos”.
  • Erase common words like “wall street”, “market”, “stock”, “share”, …
import nltk
from nltk.stem import WordNetLemmatizer
import re

stopwords = nltk.corpus.words('english')
lemmatizer = WordNetLemmatizer()

processed_text = re.sub('[^a-zA-Z]', ' ',original_text)
processed_text = processed_text.lower()
processed_text = processed_text.split()
processed_text = [lemmatizer.lemmatize(word, pos='a') for word in processed_text if word not in set(stopwords)]
processed_text = [lemmatizer.lemmatize(word, pos='v') for word in processed_text if word not in set(stopwords)]
processed_text = [lemmatizer.lemmatize(word, pos='n') for word in processed_text if word not in set(stopwords)]
processed_text = ' '.join(processed_text)
processed_text = re.sub('stock', '', processed_text)

Model

The next step is to create the model that meets our purpose. Therefore, we use the library sklearn in python. As we want to know the key information every date, we use the model TfidfVectorizer. This allows us to set the number of key features, and we set it with 8 maximum features.

from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_model = TfidfVectorizer(max_features=8)
processed_text_tf = tf_idf_model.fit_transform(preprocessed_texts)
tf_idf_values = tf_idf_model.idf_
tf_idf_names = tf_idf_model.get_feature_names()

Next, we apply this model every day with all the pieces of news in the dataset of that day. Notice that this model doesn’t have to be trained because it is a determined algorithm. By this way, we obtain a group of 8 words which represent the key information of the day. In the same way, each word has a tf-idf value that indicates the importance of it in that day.

Results

Finally, the result is summed up in the following charts. In those, all the relevant words from the model appear and in the boxes there is the related news. To clarify the result, the first graph contains the information for the first part of 2019, and the second graph reflects the second half of 2019.

Relevant topics and related pieces of news from January 2019 to June 2019.
Relevant topics and related pieces of news from January 2019 to June 2019.
Relevant topics and related pieces of news form July 2019 to January 2020.
Relevant topics and related pieces of news form July 2019 to January 2020.

What else?

It seems as if the model works to take the most important words from a great deal of pieces of news. However, the challenge is to use this model to take advantage of this information and to be able to develop a strategy that performances and makes us earn money.

0 Comments
Inline Feedbacks
View all comments