urllib.request provides a number of functions for retrieving data from the web. For example, we can use it to open a URL, read the contents of a URL, and write to a URL.
Example of creating a request to retrieve data using urllib.request.urlopen():
import urllib.requestsample_url ="http://httpbin.org/xml"response = urllib.request.urlopen(sample_url)status_code = response.status # check the status codeprint(status_code)# if no error, then read the response contentif status_code >=200and status_code <300:print(response.getheaders())# get the headersprint(response.getheader("Content-length"))# get the header valueprint(response.getheader("Content-Type"))# get the content type# read the data from the URL data = response.read().decode("utf-8")print(data)
Printed result is like this:
200
[('Date', 'Sat, 22 Jan 2022 20:31:43 GMT'), ('Content-Type', 'application/xml'), ('Content-Length', '522'), ('Connection', 'close'), ('Server', 'gunicorn/19.9.0'), ('Access-Control-Allow-Origin', '*'), ('Access-Control-Allow-Credentials', 'true')]
522
application/xml
<?xml version='1.0' encoding='us-ascii'?>
<!-- A SAMPLE set of slides -->
<slideshow
title="Sample Slide Show"
date="Date of publication"
author="Yours Truly"
>
<!-- TITLE SLIDE -->
<slide type="all">
<title>Wake up to WonderWidgets!</title>
</slide>
<!-- OVERVIEW -->
<slide type="all">
<title>Overview</title>
<item>Why <em>WonderWidgets</em> are great</item>
<item/>
<item>Who <em>buys</em> WonderWidgets</item>
</slide>
</slideshow>
Parsing HTML
We can work with HTML data via the HTMLParser module. For example,
# working with HTML data via the HTMLParserfrom html.parser import HTMLParsermetacount =0# define a class that will handle various parts of an HTML file# create a subclass of HTMLParser and override the handler methodsclassMyHTMLParser(HTMLParser):defhandle_starttag(self,tag,attrs):global metacountif tag =="meta": metacount +=1print("Encountered a start tag:", tag) pos = self.getpos()# returns a tuple indication line and characterprint("\tAt line: ", pos[0], " position ", pos[1])iflen(attrs)>0:print("\tAttributes:")for a in attrs:print("\t", a[0], "=", a[1])# function to handle the ending tagdefhandle_endtag(self,tag):print("Encountered an end tag:", tag)# function to handle character and text data (tag contents)defhandle_data(self,data):if (data.isspace()):returnprint("Encountered some text data:", data)# function to handle the processing of HTML commentsdefhandle_comment(self,data):print("Encountered comment:", data)# create an instance of the parserparser =MyHTMLParser()# open the sample HTML file and read itf =open("samplehtml.html")if f.mode =="r": contents = f.read()# read the entire file parser.feed(contents)print(f"{metacount} meta tags encountered")
Using JSON
JSON is acronym for JavaScript Object Notation. It is a lightweight data-interchange format. json is a Python standard library that provides a number of functions for working with JSON data.
Example:
# working with JSON dataimport urllib.requestimport json# use urllib to retrieve some sample JSON datareq = urllib.request.urlopen("http://httpbin.org/json")data = req.read().decode('utf-8')print(data)# use the JSON module to parse the returned dataobj = json.loads(data)# when the data is parsed, we can access it like any other objectprint(obj["slideshow"]["author"])for slide in obj["slideshow"]["slides"]:print(slide["title"])# python objects can also be written out as JSONobjdata ={"name":"Joe Marini","author":True,"titles": ["Learning Python","Advanced Python","Python Standard Library Essential Training" ]}withopen("jsonoutput.json", "w")as fp: json.dump(objdata, fp, indent=4)