Open In App

How to Make an Email Extractor in Python?

Last Updated : 21 Mar, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will see how to extract all the valid emails in a text using python and regex. 

  • A regular expression shortened as regex or regexp additionally called a rational expression) is a chain of characters that outline a seek pattern. Usually, such styles are utilized by string-looking algorithms for “locate” or “locate and replace” operations on strings, or to enter validation.
  • It is a method evolved in theoretical computer technology and natural language theory.
  • The re module in python provides full support for Perl-like regular expressions in Python. It offers a set of functions that allows us to search a string for a match.
  • The re.findall() function defined in the re python module accepts two parameters and returns a list of all the matching strings found.

Syntax: re.findall(  regex , string )

Parameters: 

  • The regex is the regular expression which is made of various predefined symbols used to search for the pattern we are looking for.
  • The string is the original string on which we are going to perform search action on.

After importing the necessary module, we will call findall() method defined in the re module to find all the strings that match the regex expression passed as a parameter.

The regex expression can be divided into three parts:

1. r”[A-Za-z0-9_%+-.]+”

This expression looks for a continuous sequence of characters consist of all capital alphabets defined by A-Z, lowercase alphabets a-z, all digits 0-9, and special characters such as _%+-. . The ‘+’ is used to append the second regex to the first.

2. r”@[A-Za-z0-9.-]+”

This expression looks for a continuous sequence of characters consist of all capital alphabets defined by A-Z, lowercase alphabets a-z, all digits 0-9, and special characters such as ._. The ‘+’ is used to append the second regex to the first.

3. r”\.[A-Za-z]{2,5}”

This expression looks for a continuous sequence of characters consist of all capital alphabets defined by A-Z, lowercase alphabets a-z such that the size of this continuous sequence is between 2-5 both inclusive.

Example 1: Extract valid emails from a string

Python3




# Raw text
text = "Duis info@geeksforgeeks.com convallis. Parturient montes nascetur ridiculus mus \
geeksforgeeks@rocks.xyz mauris. Odio eu feugiat pre@rsos_tium.index nibh ipsum consequat love@gfg.in \
pretium aenean pharetra magna ac placerat. Vitae justo eget magna fermentum iaculis eu non."
  
#import regex module
import re
  
#finding all valid emails using regex
reg = re.findall(r"[A-Za-z0-9_%+-.]+"
                 r"@[A-Za-z0-9.-]+"
                 r"\.[A-Za-z]{2,5}",text)
  
#printing all the valid emails found
print(reg)


Output:

['info@geeksforgeeks.com', 'geeksforgeeks@rocks.xyz', 'love@gfg.in']

Example 2: Extract valid emails from a text file

Using open() function we open the required file in “r” mode, read mode only. And for each line, we strip the line so as to remove white spaces and the process them similarly to the first example.

Python3




#importing module
import re
  
with open('sample.txt','r') as file:
  for line in file:
    line = line.strip()
      
    # finding all valid emails
    reg = re.findall(r"[A-Za-z0-9_%+-.]+"
                      r"@[A-Za-z0-9.-]+ "
                      r"\.[A-Za-z]{2,5}",line)
  
#printing all the valid emails found
print(reg)


Output:

['info@geeksforgeeks.com', 'geeksforgeeks@rocks.xyz', 'love@gfg.in']


Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads