quote() replaces special characters in a string with their URL-encoded equivalents. For example,
sample_string = "Hello El Niño!"
print(urllib.parse.quote(sample_string))
print(urllib.parse.quote_plus(sample_string))
The result is like this:
Hello%20El%20Ni%C3%B1o%21
Hello+El+Ni%C3%B1o%21
urlencode() converts a dictionary to a URL query string. For example,
sample_dict = {"name": "El Niño", "age": "12"}
print(urllib.parse.urlencode(sample_dict))
The result is like this:
name=El+Ni%C3%B1o&age=12
Retrieving internet data
urllib.request provides a number of functions for retrieving data from the web. For example, we can use it to open a URL, read the contents of a URL, and write to a URL.
Example of creating a request to retrieve data using urllib.request.urlopen():
import urllib.request
sample_url = "http://httpbin.org/xml"
response = urllib.request.urlopen(sample_url)
status_code = response.status # check the status code
print(status_code)
# if no error, then read the response content
if status_code >= 200 and status_code < 300:
print(response.getheaders()) # get the headers
print(response.getheader("Content-length")) # get the header value
print(response.getheader("Content-Type")) # get the content type
# read the data from the URL
data = response.read().decode("utf-8")
print(data)
Printed result is like this:
200
[('Date', 'Sat, 22 Jan 2022 20:31:43 GMT'), ('Content-Type', 'application/xml'), ('Content-Length', '522'), ('Connection', 'close'), ('Server', 'gunicorn/19.9.0'), ('Access-Control-Allow-Origin', '*'), ('Access-Control-Allow-Credentials', 'true')]
522
application/xml
<?xml version='1.0' encoding='us-ascii'?>
<!-- A SAMPLE set of slides -->
<slideshow
title="Sample Slide Show"
date="Date of publication"
author="Yours Truly"
>
<!-- TITLE SLIDE -->
<slide type="all">
<title>Wake up to WonderWidgets!</title>
</slide>
<!-- OVERVIEW -->
<slide type="all">
<title>Overview</title>
<item>Why <em>WonderWidgets</em> are great</item>
<item/>
<item>Who <em>buys</em> WonderWidgets</item>
</slide>
</slideshow>
Parsing HTML
We can work with HTML data via the HTMLParser module. For example,
# working with HTML data via the HTMLParser
from html.parser import HTMLParser
metacount = 0
# define a class that will handle various parts of an HTML file
# create a subclass of HTMLParser and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
global metacount
if tag == "meta":
metacount += 1
print("Encountered a start tag:", tag)
pos = self.getpos() # returns a tuple indication line and character
print("\tAt line: ", pos[0], " position ", pos[1])
if len(attrs) > 0:
print("\tAttributes:")
for a in attrs:
print("\t", a[0], "=", a[1])
# function to handle the ending tag
def handle_endtag(self, tag):
print("Encountered an end tag:", tag)
# function to handle character and text data (tag contents)
def handle_data(self, data):
if (data.isspace()):
return
print("Encountered some text data:", data)
# function to handle the processing of HTML comments
def handle_comment(self, data):
print("Encountered comment:", data)
# create an instance of the parser
parser = MyHTMLParser()
# open the sample HTML file and read it
f = open("samplehtml.html")
if f.mode == "r":
contents = f.read() # read the entire file
parser.feed(contents)
print(f"{metacount} meta tags encountered")
Using JSON
JSON is acronym for JavaScript Object Notation. It is a lightweight data-interchange format. json is a Python standard library that provides a number of functions for working with JSON data.
Example:
# working with JSON data
import urllib.request
import json
# use urllib to retrieve some sample JSON data
req = urllib.request.urlopen("http://httpbin.org/json")
data = req.read().decode('utf-8')
print(data)
# use the JSON module to parse the returned data
obj = json.loads(data)
# when the data is parsed, we can access it like any other object
print(obj["slideshow"]["author"])
for slide in obj["slideshow"]["slides"]:
print(slide["title"])
# python objects can also be written out as JSON
objdata = {
"name": "Joe Marini",
"author": True,
"titles": [
"Learning Python", "Advanced Python", "Python Standard Library Essential Training"
]
}
with open("jsonoutput.json", "w") as fp:
json.dump(objdata, fp, indent=4)