NLP - SVM to Transformer-04(Language Models)

NLP - SVM to Transformer-04(Language Models)


Strat + LM Defn Hello AI Enthusiasts!!! This is the 5th post in our series on Natural Language processing. In this post, we will understand and learn Language modelling. In a very simple explanation, Language modelling is the task of finding the next best word/character given a sequence of words or characters. Let's check the definition from the book "Deep Learning with Python by Francois Chollet"

Any network that can model the probability of the next token given the previous ones is called a language model. A language model captures the latent space of language: its statistical structure.

It means, based on a Corpus e.g. Wikipedia dump and given a sequence e.g. "Natural language" what is the probability of different words to be the next word i.e. "Processing":0.5, "Engine":0.3, "Speaker":0.2 etc.
We can achieve this via. two ways

  • Classical Statistics and probability-based approach i.e. N-gram model
  • Deep Learning-based sequence model


A. N-gram model

Let's say we have to predict the next word for "Transfer learning has been recently shown to drastically increase the <...>"
What we want is,
p(w | transfer learning has been recently shown to drastically increase the)
Which is equal to,
Count of the word after "Transfer learning has been recently shown to drastically increase the"/Count of "Transfer learning has been recently shown to drastically increase the"
It's quite obvious from the above equation that even with a very large corpus we can't get too many repetitions of such a long sequence and eventually hardly get more than one word to predict.
To solve the above challenge, we will make an assumption that each word is only dependent on past "N" words. "N" will be the parameter that we can change and accordingly the model will have farsightedness in past or lack of it.
In the above example, let's assume N = 3, then the next word is only dependent on "drastically increase the". Although a good prediction would be "Performance" but with only 3 words to condition-upon, it may predict "Stamina" too. Also, it will depend on the underlying Corpus used.

Let's see the formal definition from Wikipedia.

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The n-grams typically are collected from a text or speech corpus. Using Latin numerical prefixes, an n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". English cardinal numbers are sometimes used, e.g., "four-gram", "five-gram", and so on.

e.g. for our small corpus i.e. "transfer learning has been recently shown to drastically increase the performance"

1-gram = ["transfer", "learning", "has", "been", "recently", "shown", "to", " "]
2-gram = ["transfer learning", "learning has", "has been", "been recently" ]
3-gram = ["transfer learning has", "learning has been", "has been recently"]

So, now the steps are very simple

  • Decide an N for n-gram. Remember smaller the "n", lesser will be the contextual understanding in the prediction. It will only depend on the last few words.
  • Count each unique n-gram and the unique words following it
  • Predict the next word as the word having the highest count(being the next word) for the n-gram

What if, the n-gram combination is new in the test data
There is a good discussion in this beautiful book on NLP i.e. Speech and Language Processing. We will follow a very simple approach i.e. putting a random prediction for such cases.

Let's code

We will use a news corpus dataset. It has small chunks of news description from different publishing houses.

# Load the dataset
dataframe = pd.read_csv("/content/ag_news_csv/train.csv",names=['publisher','description']) 

# Remove all non aphabetical char, All multiple space to single space, All to lower
import re
regex = re.compile('[^a-zA-Z]')
data = x:regex.sub(' ', x.lower()))
data = x:re.sub(' +',' ', x).split())`

All other parts of the code is quite trivial and self-explanatory.

def make_gram(list_token, n=4):
    n_gram_data = {} # Blank dict - format will be {k1:[w1,w2,w1,w3], k2:[...],...}
    for i in range(len(list_token)-n): #1

        key = '_'.join(list_token[i:i+n]) #2
        val = list_token[i+n]

        if key not in n_gram_data.keys(): #3
            n_gram_data[key] = [val]
        else :

corpus = []
for id,rows in data.iteritems(): #4
    make_gram(rows, 3)

#1 - Loop on the list of all the tokens
#2 - Create key for each n token i.e. for each n-gram
#3 - Append if key exists or insert new value in the dict
#4 - Create the n-gram dict for the news dataset with n=3. Corpus is a copy of all tokens

string = 'it will not' #Sees n-gram
input = '_'.join(string.split()) # Convert it to Keys

length = 50  # Number of word to predict
best_of = 2
print(string, end=' ') # Print each without new line
while length<0:
    vals = n_gram_data.get(input,'NA') # This is a list

    # Smoothening #1
    if vals=='NA':
        vals = [corpus[np.random.randint(0,corpus_len,1)[0]]]
    # Smoothening Ends

    prob_dict = dict([(elem,vals.count(elem)/len(vals)) for elem in vals]) #2

    pred = sorted(prob_dict, key=prob_dict.get)[:best_of] #3
    best_of = len(pred) if len(pred)<best_of else best_of #4
    next_word = pred[np.random.randint(0,best_of,1)[0]]

    print(next_word, end=' ')
    input = input.split('_') + [next_word]
    input = '_'.join(input[1:])


#1 - If a key is not available, pick the next word at random
#2 - Calculated the probability of each values and made it a key. This was for one key
#3 - Sorted the prob_dict on values [Not keys]
#4 - We don't always want to pick the word having highest probability. So we pick anyone out of random top K[See "best_of" parm]. This adds newness in the generated text.
All other parts of the code is quite trivial and self-explanatory.

it will not impose fuel surcharges on domestic and international air fares will rise by and respectively because of the trademarked keyterms that companies bid for in the adwords keyword advertising program lt p gt ottawa reuters nortel networks corp investors predicted tuesday the telecom equipment giant the subject of how miguel angel

Although it lacks coherence in terms of the meaning of the sentence but we have got a decent output considering the simplicity of the method used.

B. Deep Learning-based sequence model

Two key limitations of the previous approach are,

  • Need to fix the n-gram in the beginning
  • Need of smoothening

With a Deep Learning model,

  • We can input a sequence of variable length
  • We will get an output in every scenario

We will build our Deep Learning model on char as a token unlike using the words as done in the previous section. With char, we will have only 27 token i.e. 26 letters and one for space. We want to avoid embedding here to keep the explanation single focussed.

It's a simple Recurrent Classification model i.e. predicting the next char.

Let's code

corpus = []
for id,row in data.iteritems(): #1
    for elem in row:
        corpus.append(' ')
        corpus.extend(list(elem)) # Remember extend vs append

corpus = np.array(corpus)
corpus = corpus.reshape(-1,1)

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()

corpus = ohe.fit_transform(corpus).toarray() #2
corpus = corpus.astype('float16')

#1 - Dreated the corpus of each token(char here)
#2 - OneHotEncoded all the chars
All other parts of the code is quite trivial and self-explanatory.

x = []; y = [] ; seq_len = 10 #1
total_seq_len = 500000

for i, char in enumerate(corpus[:total_seq_len]): #2
    x.append(corpus[i:i+seq_len]) #3

x = np.array(x)
y = np.array(y)

#1 - Defining the sequence length and vocab size(total_seq_len )
#2 - Using the corpus only till vocab_size length
#3 -Defining x as seq_len chars sequence. Y as the ext char after seq_len chars
All other parts of the code is quite trivial and self-explanatory.

def data_gen(batch_size): #1
    while True:
        for i in range(x.shape[0]-batch_size): #2
            yield x[i*batch_size:i*batch_size+batch_size], y[i*batch_size:i*batch_size+batch_size]

#1 - Definined a generator of batch_size
#2 - Looped x and yielded x, y of length batch_size

import tensorflow as tf
from tensorflow import keras
dropout = 0.1; neurons = 128
model = keras.models.Sequential([
    keras.layers.Bidirectional(keras.layers.LSTM(neurons, return_sequences=True,input_shape=[x.shape[2]])), 
    keras.layers.Bidirectional(keras.layers.LSTM(neurons, return_sequences=True, )),
    keras.layers.Dense(250 , activation='relu'),
    keras.layers.Dense(27, activation='softmax')
]) #1

optimizer = keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

history =, epochs=75, steps_per_epoch=total_seq_len//batch_size) #2

#1 - Defined a simple Recurrent Neural Network
#2 - Since we have define a custo generator, so steps_per_epoch is needed otherwise 1st epoch will not end

string = 'the company'[:seq_len] 
length = 125; i= 0

print(string, end=' ')
while length>0:
    input = ohe.transform(np.array(list(string)).reshape(-1,1)).toarray().reshape(1,len(string),-1)

    vals = model.predict(input)
    argmax = np.argmax(vals)
    vals[:] = 0
    vals[0,argmax] = 1
    char = ohe.inverse_transform(vals)

    print(char[0][0], end='')
    string = string[1:]+char[0][0]


A trivial code to predict next char in a running wondow manner

the compan y s credit picture industry group said on wednesday after the web search engine has slashed the price range to beevised the b


Our sentences are not very coherent since we didn't use one homogeneous corpus e.g. a book etc. It was more of a collection of news pieces. Still the words speling were very accurate. You can try,

  • The same exercise on a different Corpus
  • Try Deep Learning based model for word level token This was all for this post. In the next post of NLP series, we will learn and code a Language Translator i.e. German to English etc.