Extract the HTML code of the given tag and its parent using BeautifulSoup
Last Updated :
16 Mar, 2021
In this article, we will discuss how to extract the HTML code of the given tag and its parent using BeautifulSoup.
Modules Needed
First, we need to install all these modules on our computer.
- BeautifulSoup: Our primary module contains a method to access a webpage over HTTP.
pip install bs4
- lxml: Helper library to process webpages in python language.
pip install lxml
- requests: Makes the process of sending HTTP requests flawless.the output of the function.
pip install requests
Scraping A Sample Website
- We import our beautifulsoup module and requests. We declared Header and added a user agent. This ensures that the target website we are going to web scrape doesn’t consider traffic from our program as spam and finally gets blocked by them.
Python3
from bs4 import BeautifulSoup
import requests
webpage = requests.get(URL)
soup = BeautifulSoup(webpage.content, "lxml" )
|
- Now to target the element about which you want to get the info right click it and click inspect element. Then from the inspect element window try to find an HTML attribute that is unique to others. Most of the time it’s the Id of the element.
Here to extract the HTML of the title of the site, we can extract this easily using the id of the title.
Python3
title = soup.find( "h1" , attrs = { "id" : 'firstHeading' })
print (title)
|
- Now extracting the content of the concerned tag, we can simply use the .get_text() method. The implementation would be as below:
Python3
cont = title.get_text()
print (cont)
|
- Now to extract the HTML of the parent element of a concerning element, let’s take an example of a span having the ID “Machine_learning_approaches”.
We need to extract it that displays the HTML in lists of lists form.
Python3
parent = soup.find( "span" ,
attrs = { "id" : 'Machine_learning_approaches' }).parent()
print (parent)
|
Below is the complete program:
Python3
from bs4 import BeautifulSoup
import requests
webpage = requests.get(URL)
soup = BeautifulSoup(webpage.content, "lxml" )
title = soup.find( "h1" , attrs = { "id" : 'firstHeading' })
print (title)
cont = title.get_text()
print (cont)
parent = soup.find( "span" ,
attrs = { "id" : 'Machine_learning_approaches' }).parent()
print (parent)
|
Output:
You can also refer to this video for an explanation:
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...