Understanding Language Modeling for Dummies

Language Modeling is the probability of a given sequence of words or a sequence of characters, complex?. Let try a simpler definition.

Language Modeling is the ability to predict the next word or character since characters made-up a word. “man” is a word that has 3 characters: ‘m, ‘a‘, ’n’. It cut across various applications such as speech recognition, optical character recognition, text generation, handwriting recognition, machine translation, and spelling correction.

Commonly, found application is in our phone keyboard, which suggests a text to us while typing.

“ …I am going to the …..”,

The sentence could be completed by using the following words: market, hospital, farm.

I am going to the market”, “I am going to the hospital”, “I am going to the farm”.

The task of predicting the next word is what language modeling is all about in natural language processing (NLP), using machine learning to predict (guess) the word. How can we achieve this in NLP?

  1. n-gram Language Models
  2. fixed window Neural Language Models
  3. vanilla Recurrent Neural Network
  4. Long Short Term Memory
  5. Gated Recurrent Unit

n-gram Language Model

n-gram is the number of words we want to consider for our prediction(guess).

E.g if the sentence is, “When my sister comes back, I am going to the …”. For,

1- gram, we consider “the

2 -grams, we consider “to the

3-grams we consider “going to the

How do we predict using n-grams? We take the mode (highest occurring) of all the words different n grams e.g “am going to the market”, “am going to the hospital”, “am going to the farm”. in our corpus (the body of text we considering as our data)? We look at the number of times the n-grams occur in the corpus and take the probability of the n-gram of words. Mathematically,

reference from Standford cs224n

Equation!! Freaking out!! Relax! Let go back to our example

Given, n =5: then we will consider the preceding 4 words before the predicting word, which is “am going to the…..”. to predict we will compute the probability using the formula stated above.

where could be market, hospital or farm

The number of times “am going to the…..” occur is 10, probability of this is

if the number of times “am going to the market” is 5, the probability is

if the number of times “am going to the hospital” is 3, the probability is:

if the number of times “am going to the farm” is 2. the probability is:

Our model will predict“am going to the market” because it has the highest frequency.

Frailties in this prediction model,

  1. Sparsity Problem, what if the number of the correct word is “salon” not market which never occurred before in the data (corpus)
  2. Storage Problem, you will need to store a large corpus to count the number of each gram this is a computational expensive.

Solution: Deep Learning Neural Network

Recurrent Neural Network

Recurrent Neural Network(RNN) is a type of Neural Network where the output from the previous step is fed as input to the current step. In traditional neural networks, all the inputs and outputs are independent of each other, but in cases like when it is required to predict the next word of a sentence, the previous words are required and hence there is a need to remember the previous words. The layers of our network are the input layer, hidden layers, and the output layers.

reference from Standford cs224n

input sequence

Input Sequence has to be represented into a binary format for the model, this is text representation. It is done via Word Embedding. Word embedding is a learned representation for text where words that have the same meaning have a similar representation.

This is done by the one-hot encoding of the input text, each input sequence is encoded using the one-hot encoding of each word. One hot encoding is converting each of the word (categorical data) to a numerical format that the model could understand. The word “I am going to the” can be represented as [1 42 12 1 3], then the encoded-words are trained so as to represents each word as an n-dimensional vector of floating-point values that conveys information based on text similarity.

hidden states

Each hidden state can be calculated by taking the sigmoid function of the input of a step, the hidden layer of the previous step and the bias, represented by the figure below. The sigmoid function is an activation function that helps to define the output node form a given input. Using the sigmoid function defined an output whose value would range from 0 to 1. Computation for step t can (in theory) use information from many steps back.

reference from Standford cs224n


The output is processed through softmax. Softmax is an activation function that normalizes all the different output probabilities to have them sum to 1. The equation to process the output distribution can be represented as;

reference from Standford cs224n

Model size doesn’t increase for longer input. The same weights applied on every timestep, into a single recurrent layer so there is symmetry in how inputs are processed.

In training the RNN for language modeling, a corpus of text which is a sequence of words is feed into RNN compute output distribution for every step t i.e. predict the probability distribution of every word, given words so far. Loss function on step t is cross-entropy between predicted probability distribution, and the true next word (one-hot for ):

The average of each step t loss function gets the overall loss for the entire training set; Compute loss, gradients and update weights. The loss is the difference between the predicted output and the true output. Cross-entropy loss is used because it measures the performance of a classification model whose output is a probability value between 0 and 1. The total loss respectively can be given as ;

reference from Standford cs224n

Drawback of RNN

Slow Computation: Computing output step for each time step is can be time-consuming

Vanishing Gradient: Computing the gradient using backpropagation which is done by using the mathematical chain rule, the derivatives function of a function, which diminishes the gradient as we compute for each time sequence.

The drawback is not limited to the above mention we have other drawbacks such as exploding gradient … But other modern Neural Networks have been modularized to solve the problem such as LSTM, GRU, Transformer as others.

Thanks for reading!!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store