Python – Preprocessing of Tamil Text

Last Updated : 27 Sep, 2021

Preprocessing is the major part of Natural Language Processing. In order to classify any text with high accuracy, cleaned data plays a major role. So, the first step in NLP before analyzing or classifying is preprocessing of data. Many python libraries support preprocessing for the English language. But for the Tamil Language, there are very few preprocessing libraries available. Here is an example of a few preprocessing techniques for Tamil text.

The preprocessing techniques involved in this article are

Punctuation removal
Tokenization
Stop words removal

Punctuation removal:

Python3

# Importing python string function 
import string      
# Printing Inbuilt punctuation function 
print(string.punctuation)    

Output:

!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~

If there are any of the above punctuations in the text, they will be removed after preprocessing. This can be removed by using the python string module.

Python3

# Function for removing punctuation 
def punctuation_remove(text_data):  
    # Appending non punctuated words 
    punctuation ="".join([t for t in text_data if t not in string.punctuation])   
    return punctuation 
  
# Passing input to the function 
punctuation_removed = punctuation_remove("வெற்றி *பெற வேண்டும், என்ற பதற்றம் ^இல்லாமல் _இருப்பது தான் 'வெற்றி பெறுவதற்கான சிறந்த வழி.")  
print(punctuation_removed) 

Output:

வெற்றி பெற வேண்டும் என்ற பதற்றம் இல்லாமல் இருப்பது தான் வெற்றி பெறுவதற்கான சிறந்த வழி

Explanation: All the punctuations in the given text are removed.

Tokenization:

Tokenization is nothing but splitting each word in a sentence into a token which will be used for further classification. To convert a text into tokens, a regular expression module in python is used.

Python3

# importing python regular expression module 
import re     
  
# Function for tokenization 
def tokenization(text_data): 
   # Splitting the sentence into words where space is found. 
   tokens_text = re.split(' ',text_data)       
   return tokens_text 
    
    
# Passing the punctuation removed text as parameter for tokenization   
tokenized_text = tokenization(punctuation_removed)   
print(tokenized_text)

Output:

[‘வெற்றி’, ‘பெற’, ‘வேண்டும்’, ‘என்ற’, ‘பதற்றம்’, ‘இல்லாமல்’, ‘இருப்பது’, ‘தான்’, ‘வெற்றி’, ‘பெறுவதற்கான’, ‘சிறந்த’, ‘வழி’]

Explanation: All the words in a sentence are split into tokens.

Stop words removal:

Stop words are frequently used words in a language. These words are unnecessary for the meaning of the sentence. Removal of stop words can be done by using the NLTK package in python. The NLTK package supports many languages like English, French, German, Finnish, Italian, etc but not the Tamil language. So, download the stop words for Tamil in the given link – Github Link and name the file as tamil and place it in the below location in your system – “ ….\AppData\Roaming\nltk_data\corpora\stopwords“

After this process, the NLTK package supports Tamil stop words also

Python3

# Importing Natural Language Toolkit python library 
import nltk 
  
# Storing all the Tamil stop words in the variable retrieved from the file ‘tamil’  
stopwords = nltk.corpus.stopwords.words('tamil')   
  
# Function for removing stop words 
def stopwords_remove(text_data): 
    # Appending words which are not stop words   
    removed= [s for s in text_data if s not in stopwords]   
    return removed 
  
# Passing tokenized text as parameter for removing stop words 
stopwords_removed = stopwords_remove(tokenized_text)  
print(stopwords_removed)

Output:

[‘வெற்றி’, ‘பெற’, ‘பதற்றம்’, ‘இல்லாமல்’, ‘இருப்பது’, ‘வெற்றி’, ‘பெறுவதற்கான’, ‘சிறந்த’, ‘வழி’]

Explanation: The stop words ‘வேண்டும்’, ‘என்ற’ and ‘தான்’ are removed.

Suggest improvement

Text Preprocessing in Python

Share your thoughts in the comments

Python – Preprocessing of Tamil Text

Punctuation removal:

Python3

Python3

Tokenization:

Python3

Stop words removal:

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?