Python | Measure similarity between two sentences using cosine similarity
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.
Similarity = (A.B) / (||A||.||B||) where A and B are vectors.
Cosine similarity and nltk toolkit module are used in this program. To execute this program nltk must be installed in your system. In order to install nltk module follow the steps below –
1. Open terminal(Linux).
2. sudo pip3 install nltk
3. python3
4. import nltk
5. nltk.download(‘all’)
Functions used:
nltk.tokenize: It is used for tokenization. Tokenization is the process by which big quantity of text is divided into smaller parts called tokens. word_tokenize(X)
split the given sentence X into words and return list.
nltk.corpus: In this program, it is used to get a list of stopwords. A stop word is a commonly used word (such as “the”, “a”, “an”, “in”).
Below is the Python implementation –
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
X = "I love horror movies"
Y = "Lights out is a horror movie"
X_list = word_tokenize(X)
Y_list = word_tokenize(Y)
sw = stopwords.words( 'english' )
l1 = [];l2 = []
X_set = {w for w in X_list if not w in sw}
Y_set = {w for w in Y_list if not w in sw}
rvector = X_set.union(Y_set)
for w in rvector:
if w in X_set: l1.append( 1 )
else : l1.append( 0 )
if w in Y_set: l2.append( 1 )
else : l2.append( 0 )
c = 0
for i in range ( len (rvector)):
c + = l1[i] * l2[i]
cosine = c / float (( sum (l1) * sum (l2)) * * 0.5 )
print ( "similarity: " , cosine)
|
Output:
similarity: 0.2886751345948129
Last Updated :
11 Jan, 2023
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...