Convert HTML source code to JSON Object using Python
Last Updated :
03 Mar, 2021
In this post, we will see how we can convert an HTML source code into a JSON object. JSON objects can be easily transferred, and they are supported by most of the modern programming languages. We can read JSON from Javascript and parse it as a Javascript object easily. Javascript can be used to make HTML for your web pages.
We will use xmltojson module in this post. The parse function of this module takes the HTML as the input and returns the parsed JSON string.
Syntax: xmltojson.parse(xml_input, xml_attribs=True, item_depth=0, item_callback)
Parameters:
- xml_input can be either a file or a string.
- xml_attribs will include attributes if set to True. Otherwise, ignore them if set to False.
- item_depth is the depth of children for which item_callback function is called when found.
- item_callback is a callback function
Environment Setup:
Install the required modules :
pip install xmltojson
pip install requests
Steps:
Python3
import xmltojson
import json
import requests
|
- Fetch the HTML code and save it into a file.
Python3
headers = {
'User-Agent' : 'Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_10_1 ) AppleWebKit / 537.36 \
(KHTML, like Gecko) Chrome / 39.0 . 2171.95 Safari / 537.36 '
}
html_response = requests.get(url = url, headers = headers)
with open ( "sample.html" , "w" ) as html_file:
html_file.write(html_response.text)
|
- Use the parse function to convert this HTML into JSON. Open the HTML file and use the parse function of xmltojson module.
Python3
with open ( "sample.html" , "r" ) as html_file:
html = html_file.read()
json_ = xmltojson.parse(html)
|
- The json_ variable contains a JSON string that we can print or dump into a file.
Python3
with open ( "data.json" , "w" ) as file :
json.dump(json_, file )
|
Complete Code:
Python3
import xmltojson
import json
import requests
headers = {
'User-Agent' : 'Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_10_1 ) AppleWebKit / 537.36 \
(KHTML, like Gecko) Chrome / 39.0 . 2171.95 Safari / 537.36 '
}
html_response = requests.get(url = url, headers = headers)
with open ( "sample.html" , "w" ) as html_file:
html_file.write(html_response.text)
with open ( "sample.html" , "r" ) as html_file:
html = html_file.read()
json_ = xmltojson.parse(html)
with open ( "data.json" , "w" ) as file :
json.dump(json_, file )
print (json_)
|
Output:
{“html”: {“@lang”: “en”, “head”: {“title”: “Document”}, “body”: {“div”: {“h1”: “Geeks For Geeks”, “p”:
“Welcome to the world of programming geeks!”, “input”: [{“@type”: “text”, “@placeholder”: “Enter your name”},
{“@type”: “button”, “@value”: “submit”}]}}}}
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...