NLP with Topic Modelling

Understanding why we use Topic Modelling with easy implementation

There’s too much data today and there will be too much data in the future by a factor of a thousand. — Ben Hoffman

Why Topic modelling

Topic Modelling is mainly used to reduce the high dimensionality of features present in Natural Language Processing models allowing us to grasp and use only the main topics.

What is high dimensionality?

In brief, high dimensionality deals with the occurrence of the presence of data points that are closer to the boundary of our sample space. This proves to be a problem, as, predictions are much more difficult near the edges of the training sample. One must extrapolate from neighboring sample points rather than interpolate between them.

Extrapolation(red) Vs Interpolation(blue)

For a given training dataset, if our data points lie beyond/close to “the edge”, i.e., endpoints of the majority range of sample space, then the data points can be considered to be of extrapolating nature, and vice versa.

Why is extrapolation a problem?

Extrapolation is generally considered a concern as we are much less certain about the shape of the relationship outside the range of your data.eg.👇

Unclear prediction for extrapolar points

The Curse of dimensionality

Generally, datapoints begin to become sparse as the dimensions of our sample space increase, which, increases the number of extrapolar points and also leads to overfitting.

The Curse of Dimensionality

Higher Dimensionality of Text Based Data

To understand why Text Based Data is considered higher dimensional, let’s imagine how we would featurize data

eg of text: ”Hey there, Welcome to this medium article”

So, a general way is to figure out the sentences’ bag of words, i.e., the list of distinct words and their frequencies for a given document.

So the bag of words here can be considered as:

hey:1

there:1

welcome:1

to:1

this:1

medium:1

article:1

To make sense of the given data, we need to consider each unique word as a feature, with it’s respective value being the number of times the word occurs.

Now the given text will be a part of a larger corpus(a group of sentences/paragraphs), which can go to another super-group and so on, hence bringing about the problem of high dimensionality.

In comes, Topic Modelling!!

Topic modelling is the practice of using a quantitative algorithm to tease out the key topics that a body of text is about.

We basically want to reduce our corpus features from about 100k features to around 10–20 topics or less.

The Implementation

Let us first make some example documents and compile them.

Now, we can preprocess the data.

We will remove the stop words(words like a, an, the) which have no inherent meaning.

We will also lemmatize the words(process of grouping together inflected forms of a word, so that they can be analyzed as a single item). eg: good, better, and best basically mean the same thing, i.e. point towards something being positive, so we can group them together.

Now, we can prepare a Document Term Matrix, which is a method to convert our text corpus into a matrix representation to better fit our model, which we’ll create in the next section.

We can now create and train our LDA(Latent Dirichlet Allocation) model, which is a generative statistical model allowing us to group our data into topics.

Now we print the output

The result:

Here, the corpus has been made into 3 main topics. Each line is a topic with individual topic terms and weights.

These topics can now be used to train and learn custom-made models with relative ease and less computation.

Github Link: https://github.com/NeelGhoshal/Topic_modelling

Feel free to contact me with anything at my social handles

Instagram

Github