Getting Your Text Data Ready for Your Natural Language Processing Journey

Published in

Towards Data Science

6 min readNov 13, 2018

We are surrounded by language in all aspects of our lives. Language is the basis of our very existence, something that makes us who we are. Language has enabled us to do things in such a simplistic manner that would otherwise had been impossible, like communicating ideas, telling gigantic stories like the Lord of the Rings and even gain a deeper insight into our inner selves.

As language is such an integral part of our lives and our society, we are naturally surrounded by a lot of text. Text is available to us in form of books, news articles, Wikipedia articles, tweets and in so many other forms and from a lot of different resources.

This huge amount of text is probably the largest amount of data that is available at our disposal. A lot of insight can be gained from the written word and the information thus extracted can yield very useful results which can be used for a variety of applications. But, all this data has one little flaw to it, i.e, text is the most unstructured form of data available to us. It is pure language with no mathematical implications whatsoever. Sadly, all our machine learning and deep learning algorithms works on numbers and not on text.

So, what do we do?

Simple! We need to clean this text and convert it into mathematical data (vectors) that we can feed to our hungry algorithms so that they can churn out some great insights for us. This is called Text Preprocessing.

The text preprocessing can broadly be classified into two major steps:

Text Cleaning
Text to Vector Conversion

Let’s dive into their details…

TEXT CLEANING

The text has to be as clean as possible before we can convert it into vectors in order to avoid getting our memory hogged (Natural Language Processing is one time and memory consuming process). Following are a few steps that can be followed to clean the data:

Removing the HTML tags : Most of the text data available is web scrapped and thus it almost always contains HTML tags (eg: <br/>,<p>,<h1>,etc). These can be bad as if we convert these into vectors, all they would do is take up memory space and increase the processing time without providing any valuable information about the text whatsoever.

Removing Punctuation : Punctuation are of no use. Same as HTML tags. These need to be cleaned as well, for the same reasons as mentioned above.

Removing Stopwords : The words like ‘this’, ‘there’, ‘that’, ‘is’, etc. do not provide very usable information and can create some useless clutter in our memory. Such words are called stopwords. It is okay to remove such words with but it is advised to be cautious while doing so as words like ‘not’ are also considered as stopwords (this can be dangerous for tasks like sentiment analysis).

Stemming : Words like ‘tasteful’, ‘tastefully’, etc are all variations of the word ‘tasty’. Hence, if we will have all these words in our text data, we will end up creating vectors for each of them when all these imply (more or less) the same thing. To avoid this, we can extract the root word for all these words and create a single vector for the root word. The process of extracting out the root word is called stemming. ‘Snowball Stemmer’ is one of the most advanced word stemmers out there.

This code snippet gives the following output:

Surprisingly, the root word for tasty turns out to be tasti.

Convert everything to lower case : There is no point in having both ‘Biscuits’ and ‘biscuits’ in our data, its better to convert everything to lower case. Also we need to make sure that no words are alphanumeric.

The final code to achieve all the above processes is shown below:

TEXT TO VECTOR CONVERSION

Once we are done with cleaning our data, it is time to convert the cleaned text to vectors that our machine learning/deep learning algorithms can understand.

There are quite a few techniques available at our disposal to achieve this conversion. The simplest of them is Bag of Words.

Bag of Words — A Short Introduction

Bag of Words basically creates a dictionary of ‘d’ words (this is not a python dictionary) where ‘d’ is the number of unique words in our text corpus. It then creates ‘d’ dimensional vectors (consider it an array of length ‘d’) for each our document and each dimension (cell) has value equal to number of times the corresponding word has occurred in the document.

Something like this -

In the above example, the dictionary has eight words (are, cat, dog, is, on, table, the). The sentence in question (query) has six words(the, dog, is, on, the, table). It is evident that each cell has value of count of the number of times the corresponding word has occurred in the query.

In extremely high dimensional vectors, the number of zeros will largely exceed the number of non-zero values as each vector would have a dimension for to all the unique words in the data corpus. Such vector dimensions can be of the order of thousands, tens of thousands and maybe even more but individual documents will not have these many unique words. Such type of vectors (where majority of elements are zero) are called sparse vectors.

They look something like this -

When these sparse vectors are stacked up on each other we get a sparse matrix.

Something like this -

In this diagram it is clearly visible how most of the values are zero. This sparse matrix represents our entire text corpus as a n*d matrix where ‘n’ = number of documents in the text corpus and ‘d’ is the number of unique words in it (dimensions of individual vectors).

Here is the code snippet to obtain Bag of Words:

The output obtained is:

Thus for each of our document (525814 in total — ‘n’) we get a 70780 (‘d’) dimensional vector! That is huge!

I hope this number makes it clear why we need to clean the data before we can convert it into vectors for if we would have not cleaned our data the dimensions would have been a lot more than what we already have.

The text has been converted into vectors and we are now ready build our ML/DL models!

More on Bag of Words approach can be found here.

Other more complex text to vector conversion techniques include:

The complete Jupyter Notebook for this project can be found here.

The preprocessing steps done here are executed on Amazon Fine Food Reviews Dataset available on Kaggle. The ‘Text’ feature upon which all the steps have been executed refers to reviews (considered documents) for different food products.

That’s it for now! Thanks folks for reading this far!

Getting Your Text Data Ready for Your Natural Language Processing Journey

TEXT CLEANING

TEXT TO VECTOR CONVERSION

Written by Tanmay Lata