Open In App

Convert HTML source code to JSON Object using Python

Last Updated : 03 Mar, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

In this post, we will see how we can convert an HTML source code into a JSON object. JSON objects can be easily transferred, and they are supported by most of the modern programming languages. We can read JSON from Javascript and parse it as a Javascript object easily. Javascript can be used to make HTML for your web pages. 

We will use xmltojson module in this post. The parse function of this module takes the HTML as the input and returns the parsed JSON string.

Syntax: xmltojson.parse(xml_input, xml_attribs=True, item_depth=0, item_callback)

Parameters:

  • xml_input can be either a file or a string.
  • xml_attribs will include attributes if set to True. Otherwise, ignore them if set to False.
  • item_depth is the depth of children for which item_callback function is called when found.
  • item_callback is a callback function

Environment Setup:

Install the required modules :

pip install xmltojson
pip install requests

Steps:

  • Import the libraries

Python3




import xmltojson
import json
import requests


  • Fetch the HTML code and save it into a file.

Python3




# Sample URL to fetch the html page
  
# Headers to mimic the browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 \
    (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
  
# Get the page through get() method
html_response = requests.get(url=url, headers = headers)
  
# Save the page content as sample.html
with open("sample.html", "w") as html_file:
    html_file.write(html_response.text)


  • Use the parse function to convert this HTML into JSON. Open the HTML file and use the parse function of xmltojson module.

Python3




with open("sample.html", "r") as html_file:
    html = html_file.read()
    json_ = xmltojson.parse(html)


  • The json_ variable contains a JSON string that we can print or dump into a file.

Python3




with open("data.json", "w") as file:
    json.dump(json_, file)


  • Print the output.

Python3




print(json_)


Complete Code:

Python3




import xmltojson
import json
import requests
  
  
# Sample URL to fetch the html page
  
# Headers to mimic the browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 \
    (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
  
# Get the page through get() method
html_response = requests.get(url=url, headers = headers)
  
# Save the page content as sample.html
with open("sample.html", "w") as html_file:
    html_file.write(html_response.text)
      
with open("sample.html", "r") as html_file:
    html = html_file.read()
    json_ = xmltojson.parse(html)
      
with open("data.json", "w") as file:
    json.dump(json_, file)
      
print(json_)


Output:

{“html”: {“@lang”: “en”, “head”: {“title”: “Document”}, “body”: {“div”: {“h1”: “Geeks For Geeks”, “p”: 

“Welcome to the world of programming geeks!”, “input”: [{“@type”: “text”, “@placeholder”: “Enter your name”}, 

{“@type”: “button”, “@value”: “submit”}]}}}}



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads