Open In App

Autocorrector Feature Using NLP In Python

Last Updated : 21 Dec, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

Autocorrect is a way of predicting or making the wrong spellings correct, which makes the tasks like writing paragraphs, reports, and articles easier. Today there are a lot of Websites and Social media platforms that use this concept to make web apps user-friendly.

Autocorrector Feature Using NLP In Python

So, here we are using Machine Learning and NLP to make an autocorrection generator that will suggest to us the correct spellings for the input word. We will be using Python Programming Language for this.

Let’s move ahead with the project.

We will be using NTLK Library for the implementation of NLP-related tasks.

To import NLTK use the below command 

import nltk
nltk.download('all')

Then the first task is to import the text file we will be using to create the word list of correct words.

You can download the text file from this link.

Python3




# importing regular expression
import re
 
# words
w = []
 
# reading text file
with open('final.txt', 'r', encoding="utf8") as f:
    file_name_data = f.read()
    file_name_data = file_name_data.lower()
    w = re.findall('\w+', file_name_data)
 
# vocabulary
main_set = set(w)


Now we have to count the words and store their frequency. For that we will use dictionary.

Python3




# Functions to count the frequency
# of the words in the whole text file
 
 
def counting_words(words):
    word_count = {}
    for word in words:
        if word in word_count:
            word_count[word] += 1
        else:
            word_count[word] = 1
    return word_count


Then to calculate the probability of the words prob_cal function is used.

Python3




# Calculating the probability of each word
def prob_cal(word_count_dict):
    probs = {}
    m = sum(word_count_dict.values())
    for key in word_count_dict.keys():
        probs[key] = word_count_dict[key] / m
    return probs


The further code is divided into 5 main parts, that includes the creation of all types of different words that are possible.

To do this, we can use : 

  1. Lemmatization
  2. Deletion of letter
  3. Switching Letter
  4. Replace Letter
  5. Insert new Letter

Let’s see the code implementation of each point

To do Lemmatization we will be using pattern module. You can install it using the below command

pip install pattern

Then you can the below code

Python3




# LemmWord: extracting and adding
# root word i.e.Lemma using pattern module
import pattern
from pattern.en import lemma, lexeme
from nltk.stem import WordNetLemmatizer
 
 
def LemmWord(word):
    return list(lexeme(wd) for wd in word.split())[0]


DeleteLetter : Function that Removes a letter from a given word.

Python3




# Deleting letters from the words
def DeleteLetter(word):
    delete_list = []
    split_list = []
 
    # considering letters 0 to i then i to -1
    # Leaving the ith letter
    for i in range(len(word)):
        split_list.append((word[0:i], word[i:]))
 
    for a, b in split_list:
        delete_list.append(a + b[1:])
    return delete_list


Switch_ : This function swaps two letters of the word.

Python3




# Switching two letters in a word
def Switch_(word):
    split_list = []
    switch_l = []
 
    #creating pair of the words(and breaking them)
    for i in range(len(word)):
        split_list.append((word[0:i], word[i:]))
     
    #Printint the first word (i.e. a)
    #then replacing the first and second character of b
    switch_l = [a + b[1] + b[0] + b[2:] for a, b in split_list if len(b) >= 2]
    return switch_l


Replace_ : It changes one letter to another.

Python3




def Replace_(word):
    split_l = []
    replace_list = []
 
    # Replacing the letter one-by-one from the list of alphs
    for i in range(len(word)):
        split_l.append((word[0:i], word[i:]))
    alphs = 'abcdefghijklmnopqrstuvwxyz'
    replace_list = [a + l + (b[1:] if len(b) > 1 else '')
                    for a, b in split_l if b for l in alphs]
    return replace_list


insert_: It adds additional characters from the bunch of alphabets (one-by-one). 

Python3




def insert_(word):
    split_l = []
    insert_list = []
 
    # Making pairs of the split words
    for i in range(len(word) + 1):
        split_l.append((word[0:i], word[i:]))
 
    # Storing new words in a list
    # But one new character at each location
    alphs = 'abcdefghijklmnopqrstuvwxyz'
    insert_list = [a + l + b for a, b in split_l for l in alphs]
    return insert_list


Now, we have implemented all the five steps. It’s time to merge all the words (i.e. all functions) formed by those steps.

To implement that we will be using 2 different functions

Python3




# Collecting all the words
# in a set(so that no word will repeat)
def colab_1(word, allow_switches=True):
    colab_1 = set()
    colab_1.update(DeleteLetter(word))
    if allow_switches:
        colab_1.update(Switch_(word))
    colab_1.update(Replace_(word))
    colab_1.update(insert_(word))
    return colab_1
 
# collecting words using by allowing switches
def colab_2(word, allow_switches=True):
    colab_2 = set()
    edit_one = colab_1(word, allow_switches=allow_switches)
    for w in edit_one:
        if w:
            edit_two = colab_1(w, allow_switches=allow_switches)
            colab_2.update(edit_two)
    return colab_2


Now, The main task is to extract the correct words among all. To do so we will be using a get_corrections function.

Python3




# Only storing those values which are in the vocab
def get_corrections(word, probs, vocab, n=2):
    suggested_word = []
    best_suggestion = []
    suggested_word = list(
        (word in vocab and word) or colab_1(word).intersection(vocab)
        or colab_2(word).intersection(
            vocab))
 
    # finding out the words with high frequencies
    best_suggestion = [[s, probs[s]] for s in list(reversed(suggested_word))]
    return best_suggestion


Now the code is ready,  we can test it for any user input by the below code.

Let’s print top 3 suggestions made by the Autocorrect.

Python3




# Input
my_word = input("Enter any word:")
 
# Counting word function
word_count = counting_words(main_set)
 
# Calculating probability
probs = probab_cal(word_count)
 
# only storing correct words
tmp_corrections = get_corrections(my_word, probs, main_set, 2)
for i, word_prob in enumerate(tmp_corrections):
    if(i < 3):
        print(word_prob[0])
    else:
        break


Output : 

Enter any word:daedd
dared
daned
died

Conclusion

So, we have implemented the basic auto-corrector using the NLTK Library and Python. For further steps, we can work on the High level auto-corrector system which uses the large amount of dataset and works more efficiently. 

To enhance accuracy, we can also use transformers and more NLP related techniques like n-grams, Tf-idf, and so on.



Previous Article
Next Article

Similar Reads

Feature Extraction Techniques - NLP
Introduction : This article focuses on basic feature extraction techniques in NLP to analyse the similarities between pieces of text. Natural Language Processing (NLP) is a branch of computer science and machine learning that deals with training computers to process a large amount of human (natural) language data. Briefly, NLP is the ability of com
11 min read
Difference Between Feature Selection and Feature Extraction
Machine learning models require input features that are relevant and important to predict the outcome. However, not all features are equally important for a prediction task, and some features might even introduce noise in the model. Feature selection and feature extraction are two methods to handle this problem. In this article, we will explore the
8 min read
Is There Any Difference Between Feature Extraction and Feature Learning?
Answer: Yes, feature extraction involves manually selecting characteristics from data, while feature learning allows models to automatically discover the features to be used for a task.Feature extraction and feature learning represent two methodologies in machine learning for identifying and utilizing relevant information from raw data to improve m
1 min read
LSTM Based Poetry Generation Using NLP in Python
One of the major tasks that one aims to accomplish in Conversational AI is Natural Language Generation (NLG) which refers to employing models for the generation of natural language. In this article, we will get our hands on NLG by building an LSTM-based poetry generator. Note: The readers of this article are expected to be familiar with LSTM. In or
7 min read
HOG Feature Visualization in Python Using skimage
Object detection is a fundamental task in computer vision, where the goal is to identify and locate objects within images or videos. However, this task can be challenging due to the complexity of real-world images, which often contain varying lighting conditions, occlusions, and cluttered backgrounds. Traditional approaches to object detection rely
5 min read
Processing text using NLP | Basics
In this article, we will be learning the steps followed to process the text data before using it to train the actual Machine Learning Model. Importing Libraries The following must be installed in the current working environment: NLTK Library: The NLTK library is a collection of libraries and programs written for processing of English language writt
2 min read
NLP | Chunking using Corpus Reader
What are Chunks? These are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can’t be a part of chuck and such words are known as chinks. A ChunkRule class specifies what words or patterns to include and exclude in a chunk.How it works : The ChunkedCorpusReader class works
2 min read
NLP | Customization Using Tagged Corpus Reader
How we can use Tagged Corpus Reader ? Customizing word tokenizerCustomizing sentence tokenizerCustomizing paragraph block readerCustomizing tag separatorConverting tags to a universal tagset Code #1 : Customizing word tokenizer C/C++ Code # Loading the libraries from nltk.tokenize import SpaceTokenizer from nltk.corpus.reader import TaggedCorpusRea
2 min read
Project Idea - Searching news from Old Newspaper using NLP
We know that the newspaper is an enriched source of knowledge. When a person needs some information about a particular topic or subject he searches online, but it is difficult to get all old news articles from regional local newspapers related to our search. As not every local newspaper provides an online search for people.In this article, we will
5 min read
NLP | Using dateutil to parse dates.
The parser module can parse datetime strings in many more formats. There can be no better library than dateutil to parse dates and times in Python. To lookup the timezones, the tz module provides everything. When these modules are combined, they make it very easy to parse strings into timezone-aware datetime objects. Installation : dateutil can be
2 min read
Practice Tags :