The definitive guide to Text Preprocessing for Deep Learning
Recently, I started up with an NLP competition on Kaggle called Quora Question insincerity challenge. It is an NLP Challenge on text classification and as the problem has become more clear after working through the competition as well as by going through the invaluable kernels put up by the kaggle experts, I thought of sharing the knowledge.
Since we have a large amount of material to cover, I am splitting this post into a series of posts. The first post i.e. this one will be based on preprocessing techniques that work with Deep learning models and we will also talk about increasing embeddings coverage. In the second post, I will try to take you through some basic conventional models like TFIDF and SVM that have been used in text classification and try to access their performance to create a baseline. We will delve deeper into Deep learning models in the third post which will focus on different architectures for solving the text classification problem. We will try to use various other models which we were not able to use in this competition like ULMFit transfer learning approaches in the fourth post in the series.
As a side note: if you want to know more about NLP, I would like to recommend this awesome course on Natural Language Processing in the Advanced machine learning specialization. You can start for free with the 7-day Free Trial. This course covers a wide range of tasks in Natural Language Processing from basic to advanced: sentiment analysis, summarization, dialogue state tracking, to name a few.
It might take me a little time to write the whole series. Till then you can take a look at my other posts: What Kagglers are using for Text Classification, which talks about various deep learning models in use in NLP and how to switch from Keras to Pytorch.
So first let me start with explaining a little more about the text classification problem. Text classification is a common task in natural language processing, which transforms a sequence of a text of indefinite length into a category of text. How could you use that?
- To find the sentiment of a review.
- Find toxic comments on a platform like Facebook
- Find Insincere questions on Quora. A current ongoing competition on kaggle
- Find fake reviews on websites
- Will a text advert get clicked or not?
Now each of these problems has something in common. From a Machine Learning perspective, these are essentially the same problem with just the target labels changing and nothing else. With that said, the addition of business knowledge can help make these models more robust and that is what we want to incorporate while preprocessing the data for test classification. While the preprocessing pipeline I am focussing on in this post is mainly centered around Deep Learning but most of it will also be applicable to conventional machine learning models too.
But let me first go through the flow of a deep learning pipeline for text data before going through all the steps to get a higher level perspective about the whole process.
We normally start with cleaning up the text data and performing basic EDA. Here we try to improve our data quality by cleaning up the data. We also try to improve the quality of our word2vec embeddings by removing OOV(Out-of-Vocabulary) words. These first two steps normally don’t have much order between them and I generally go back and forth between these two steps. Next, we create a representation for text that could be fed into a deep learning model. We then start with creating our models and training them. Finally, we evaluate the models using appropriate metrics and get approval from respective shareholders to deploy our models. Don’t worry if these terms don’t make much sense now. I will try to explain them through the course of this article.
Here at this junction, let us take a little detour to talk a little about word embeddings. We will have to think about them while preprocessing data for our Deep Learning models.
A Primer on word2vec embeddings:
We need to have a way to represent words in a vocab. One way to do that could be to use One hot encoding of word vectors but that is not really a good choice. One of the major reasons is that the one-hot word vectors cannot accurately express the similarity between different words, such as the cosine similarity.
Given the structure of one hot encoded vectors, the similarity is always going to come as 0 between different words. Another reason is that as the size of vocabulary increases these one hot encoded vectors become very large.
Word2Vec overcomes the above difficulties by providing us with a fixed length vector representation of words and by capturing the similarity and analogy relationships between different words.
Word2vec vectors of words are learned in such a way that they allow us to learn different analogies. It enables us to do algebraic manipulations on words which were not possible before. For example: What is king — man + woman? It comes out to be Queen.
Word2Vec vectors also help us to find out the similarity between words. If we try to find similar words to “good”, we will find awesome, great etc. It is this property of word2vec that makes it invaluable for text classification. Now our deep learning network understands that “good” and “great” are essentially words with similar meaning.
Thus in very simple terms, word2vec creates vectors for words. Thus we have a d dimensional vector for every word(common bigrams too) in a dictionary. We normally use pretrained word vectors which are provided to us by others after training on large corpora of texts like Wikipedia, twitter etc. The most commonly used pretrained word vectors are Glove and Fasttext with 300-dimensional word vectors. We are going to use Glove in this post.
Basic Preprocessing Techniques for text data:
In most of the cases, we observe that text data is not entirely clean. Data coming from different sources have different characteristics and that makes Text Preprocessing as one of the most important steps in the classification pipeline. For example, Text data from Twitter is totally different from text data on Quora, or some news/blogging platform, and thus would need to be treated differently. Helpfully, the techniques I am going to talk about in this post are generic enough for any kind of data you might encounter in the jungles of NLP.
a) Cleaning Special Characters and Removing Punctuations:
Our preprocessing pipeline depends a lot on the word2vec embeddings we are going to use for our classification task. In principle our preprocessing should match the preprocessing that was used before training the word embedding. Since most of the embeddings don’t provide vector values for punctuations and other special chars, the first thing you want to do is to get rid of is the special characters in your text data. These are some of the special chars that were there in the Quora Question data and we use replace function to get rid of these special chars.
This could also have been done with the help of a simple regex. But I normally like the above way of doing things as it helps to understand the sort of characters we are removing from our data.
b) Cleaning Numbers:
Why do we want to replace numbers with #s? Because most embeddings have preprocessed their text like this.
Small Python Trick: We use an if statement in the code below to check beforehand if a number exists in a text. It is as an if is always fast than a re.sub command and most of our text doesn’t contain numbers.
c) Removing Misspells:
It always helps to find out misspells in the data. As those word embeddings are not present in the word2vec, we should replace words with their correct spellings to get better embedding coverage. The following code artifact is an adaptation of Peter Norvig’s spell checker. It uses word2vec ordering of words to approximate word probabilities. As Google word2vec apparently orders words in decreasing order of frequency in the training corpus. You can use this to find out some misspelled words in the data you have.
Once we are through with finding misspelled data, the next thing remains to replace them using a misspell mapping and regex functions.
d) Removing Contractions:
Contractions are words that we write with an apostrophe. Examples of contractions are words like “ain’t” or “aren’t”. Since we want to standardize our text, it makes sense to expand these contractions. Below we have done this using a contraction mapping and regex functions.
Apart from the above techniques, there are other preprocessing techniques of text like Stemming, Lemmatization and Stopword Removal. Since these techniques are not used along with Deep Learning NLP models, we won’t talk about them.
Representation: Sequence Creation
One of the things that have made Deep Learning the goto choice for NLP is the fact that we don’t really have to hand-engineer features from the text data. The deep learning algorithms take as input a sequence of text to learn the structure of text just like a human does. Since Machine cannot understand words they expect their data in numerical form. So we would like to represent out text data as a series of numbers. To understand how this is done we need to understand a little about the Keras Tokenizer function. One can use any other tokenizer also but keras tokenizer seems like a good choice for me.
In simple words, a tokenizer is a utility function to split a sentence into words. keras.preprocessing.text.Tokenizer tokenizes(splits) the texts into tokens(words) while keeping only the most occurring words in the text corpus.
The num_words parameter keeps a prespecified number of words in the text only. This is helpful as we don’t want our models to get a lot of noise by considering words that occur very infrequently. In real-world data, most of the words we leave using num_words param are normally misspells. The tokenizer also filters some non-wanted tokens by default and converts the text into lowercase.
The tokenizer once fitted to the data also keeps an index of words(dictionary of words which we can use to assign a unique number to a word) which can be accessed by tokenizer.word_index. The words in the indexed dictionary are ranked in order of frequencies.
So the whole code to use tokenizer is as follows:
where train_X and test_X are lists of documents in the corpus.
b) Pad Sequence:
Normally our model expects that each sequence(each training example) will be of the same length(same number of words/tokens). We can control this using the maxlen parameter.
Now our train data contains a list of list of numbers. Each list has the same length. And we also have the word_index which is a dictionary of most occuring words in the text corpus.
As I said I will be using GLoVE Word2Vec embeddings to explain the enrichment. GLoVE pretrained vectors are trained on the Wikipedia corpus. (You can download them here). That means some of the words that might be present in your data might not be present in the embeddings. How could we deal with that? Let’s first load the Glove Embeddings first.
Be sure to put the path of the folder where you download these GLoVE vectors. What does this glove_embedding_index contain? It is just a dictionary in which the key is the word and the value is the word vector, a np.array of length 300. The length of this dictionary is somewhere around a billion. Since we only want the embeddings of words that are in our word_index, we will create a matrix which just contains required embeddings.
The above code works fine but is there a way that we can use the preprocessing in GLoVE to our advantage? Yes. When preprocessing was done for glove, the creators didn’t convert the words to lowercase. That means that it contains multiple variations of a word like ‘USA’, ‘usa’ and ‘Usa’. That also means that in some cases while a word like ‘Word’ is present, its analog in lowercase i.e. ‘word’ is not present. We can get through this situation by using the below code.
The above was just an example of how we can use our knowledge of an embedding to get better coverage. Sometimes depending on the problem, one might also derive value by adding extra information to the embeddings using some domain knowledge and NLP skills. For example, we can add external knowledge to the embeddings themselves by adding polarity and subjectivity of a word from the TextBlob package in Python.
We can get the polarity and subjectivity of any word using TextBlob. Pretty neat. So let us try to add this extra information to our embeddings.
Engineering embeddings is an essential part of getting better performance from the Deep learning models at a later stage. Generally, I revisit this part of code multiple times during the stage of a project while trying to improve my models even further. You can show up a lot of creativity here to improve coverage over your word_index and to include extra features in your embedding.
More Engineered Features
One can always add sentence specific features like sentence length, number of unique words etc. as another input layer to give extra information to the Deep Neural Network. For example, I created these extra features as part of a feature engineering pipeline for Quora Insincerity Classification Challenge.
NLP is still a very interesting problem in Deep Learning space and thus I would encourage you to do a lot of experimentation to see what works and what doesn’t. I have tried to provide a wholesome perspective of the preprocessing steps for a Deep Learning Neural network for any NLP problem. But that doesn’t mean it is definitive. If you want to learn more about NLP here is an awesome course. If you think we can add something to the flow, do mention it in the comments.
Endnotes and References:
This post is a result of an effort of a lot of excellent Kagglers and I will try to reference them in this section. If I leave out someone, do understand that it was not my intention to do so.
- How to: Preprocessing when using embeddings
- Improve your Score with some Text Preprocessing
- Pytorch baseline
- Pytorch starter
Originally published at mlwhiz.com on January 17, 2019.
NLP Learning Series Part 1: Text Preprocessing Methods for Deep Learning was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.