Open In App

LSTM Based Poetry Generation Using NLP in Python

Last Updated : 13 Feb, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

One of the major tasks that one aims to accomplish in Conversational AI is Natural Language Generation (NLG) which refers to employing models for the generation of natural language. In this article, we will get our hands on NLG by building an LSTM-based poetry generator. 

Note: The readers of this article are expected to be familiar with LSTM. In order to get an in-depth insight into what LSTMs are you are recommended to read this article.

Dataset

The dataset used for building the model has been obtained from Kaggle. The dataset is a compilation of poetries written by numerous poets present in the form of a text file. We can easily use this data to generate embeddings and subsequently train an LSTM model. You can find the dataset here

An excerpt from the dataset is shown below:

 

Building the Text Generator

The text generator can be built in the following simple steps:

Step 1. Import Necessary Libraries

Foremost, we need to import the necessary libraries. We are going to use TensorFlow with Keras for building the Bidirectional LSTM. 

In case any of the mentioned libraries are not installed, then just install it with pip install [package-name] command in the terminal.

Python3




import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow.keras.utils as ku
from wordcloud import WordCloud
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers


Step 2. Loading the Dataset and Exploratory Data Analysis

Now, we’ll load our dataset using pandas. Further, we need to perform some Exploratory Data Analysis so that we get to know our data better. As we are dealing with text data, the best way to do so is by generating a word cloud.

Python3




# Reading the text data file
data = open('poem.txt', encoding="utf8").read()
 
# EDA: Generating WordCloud to visualize
# the text
wordcloud = WordCloud(max_font_size=50,
                      max_words=100,
                      background_color="black").generate(data)
 
# Plotting the WordCloud
plt.figure(figsize=(8, 4))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig("WordCloud.png")
plt.show()


Output : 

 

Step 3. Creating the Corpus

Now, we have all our data present in this massive text file. However, it is not recommended to feed our model with all the data altogether as it would lead to a lesser accuracy. Thus, we will be splitting our text into lines so that we can use them to generate text embeddings for our model.

Python3




# Generating the corpus by
# splitting the text into lines
corpus = data.lower().split("\n")
print(corpus[:10])


Output : 

['stay, i said',
 'to the cut flowers.',
 'they bowed',
 'their heads lower.',
 'stay, i said to the spider,',
 'who fled.',
 'stay, leaf.',
 'it reddened,',
 'embarrassed for me and itself.',
 'stay, i said to my body.']

Step 4. Fitting the Tokenizer on the Corpus

In order to generate the embeddings later, we need to fit a TensorFlow Tokenizer on the entire corpus so that it learns the vocabulary.  

Python3




# Fitting the Tokenizer on the Corpus
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
 
# Vocabulary count of the corpus
total_words = len(tokenizer.word_index)
 
print("Total Words:", total_words)


Output : 

Total Words: 3807

Step 5. Generating Embeddings/Vectorization

Now we will generate embeddings for each sentence in our corpus. Embeddings are vectorized representations of our text. Since we cannot feed Machine/Deep Learning models with unstructured text, this is an imperative step. Firstly, we convert each sentence to embedding using Keras’ text_to_sequence() function. Then we compute the length of the longest embedding; finally, we pad all the embeddings to that maximum length with zeros so as to ensure embeddings of equal length.

Python3




# Converting the text into embeddings
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
 
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)
 
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences,
                                         maxlen=max_sequence_len,
                                         padding='pre'))
predictors, label = input_sequences[:, :-1], input_sequences[:, -1]
label = ku.to_categorical(label, num_classes=total_words+1)


This is how our text embeddings would look like:

array([[   0,    0,    0, …,    0,    0,  266],

       [   0,    0,    0, …,    0,  266,    3],

       [   0,    0,    0, …,    0,    0,    4],

       …,

       [   0,    0,    0, …,    8, 3807,   15],

       [   0,    0,    0, …, 3807,   15,    4],

       [   0,    0,    0, …,   15,    4,  203]], dtype=int32)

Step 6. Building the Bi-directional LSTM Model 

By now, we are done with all the pre-processing steps that were required in order to feed the text to our model. Its time now that we start building the model. Since this is a use case of text generation, we will create a Bi-directional LSTM model as meaning plays an important role here. 

Python3




# Building a Bi-Directional LSTM Model
model = Sequential()
model.add(Embedding(total_words+1, 100,
                    input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150, return_sequences=True)))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_words+1/2, activation='relu',
                kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(total_words+1, activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='adam', metrics=['accuracy'])
print(model.summary())


The summary of the model is as follows:

Model: “sequential”

_________________________________________________________________

 Layer (type)                Output Shape              Param #   

=================================================================

 embedding (Embedding)       (None, 15, 100)           380800    

                                                                 

 bidirectional (Bidirectiona  (None, 15, 300)          301200    

 l)                                                              

                                                                 

 dropout (Dropout)           (None, 15, 300)           0         

                                                                 

 lstm_1 (LSTM)               (None, 100)               160400    

                                                                 

 dense (Dense)               (None, 3807)              384507    

                                                                 

 dense_1 (Dense)             (None, 3808)              14500864  

                                                                 

=================================================================

Total params: 15,727,771

Trainable params: 15,727,771

Non-trainable params: 0

_________________________________________________________________

None

The model will work on a next-word-prediction-based approach wherein we will input a seed text, and the model will generate poetry by predicting the subsequent words. This is why we have used a softmax activation function which is generally used for multi-class classification use cases. 

Step 7. Model Training

Having built the model architecture, we’ll now train it on our pre-processed text. Here, we have trained our model for 150 Epochs. 

Python3




history = model.fit(predictors, label, epochs=150, verbose=1)


The last few training epochs are shown below:

Epoch 145/150

510/510 [==============================] – 132s 258ms/step – loss: 3.3349 – accuracy: 0.8555

Epoch 146/150

510/510 [==============================] – 130s 254ms/step – loss: 3.2653 – accuracy: 0.8561

Epoch 147/150

510/510 [==============================] – 129s 253ms/step – loss: 3.1789 – accuracy: 0.8696

Epoch 148/150

510/510 [==============================] – 127s 250ms/step – loss: 3.1063 – accuracy: 0.8727

Epoch 149/150

510/510 [==============================] – 128s 251ms/step – loss: 3.0314 – accuracy: 0.8787

Epoch 150/150

We see that an accuracy score of 87% has been obtained, which is pretty decent. 

It is recommended that you train the model on a GPU enabled machine. If your systems happens to not have a GPU, you can make use of Google Colab or Kaggle notebooks.

Step 8. Generating Text using the Built Model

In the final step, we will generate poetry using our model. As stated earlier, the model is based upon a next-word prediction approach – hence, we need to provide the model with some seed text.

Python3




seed_text = "The world"
next_words = 25
ouptut_text = ""
 
for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences(
        [token_list], maxlen=max_sequence_len-1,
      padding='pre')
    predicted = np.argmax(model.predict(token_list,
                                        verbose=0), axis=-1)
    output_word = ""
     
    for word, index in tokenizer.word_index.items():
        if index == predicted:
            output_word = word
            break
             
    seed_text += " " + output_word
     
print(seed_text)


Output :

The world seems bright and gay and laid them all from your lip and the 

liffey from the bar blackwater white and free scholar vicar laundry laurel

Finally, we have built a model from scratch that generates poetry given an input seed text. The model can be made to generate even better results by using a larger training dataset and fiddling with the model parameters. 



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads