Internet functionality

Working with URLs

urllib is a Python standard library that provides a set of functions for working with URLs.

Example of parsing URL using urllib.parse:

import urllib.parse
sample_url = "https://www.google.com/search?q=python"
parsed_url = urllib.parse.urlparse(sample_url)
print(parsed_url)
print(parsed_url.scheme)
print(parsed_url.hostname)
print(parsed_url.path)

The result is like this:

ParseResult(scheme='https', netloc='www.google.com', path='/search', params='', query='q=python', fragment='')
https
www.google.com
/search

quote() replaces special characters in a string with their URL-encoded equivalents. For example,

sample_string = "Hello El Niño!"
print(urllib.parse.quote(sample_string))
print(urllib.parse.quote_plus(sample_string))

The result is like this:

Hello%20El%20Ni%C3%B1o%21
Hello+El+Ni%C3%B1o%21

urlencode() converts a dictionary to a URL query string. For example,

sample_dict = {"name": "El Niño", "age": "12"}
print(urllib.parse.urlencode(sample_dict))

The result is like this:

name=El+Ni%C3%B1o&age=12

Retrieving internet data

urllib.request provides a number of functions for retrieving data from the web. For example, we can use it to open a URL, read the contents of a URL, and write to a URL.

Example of creating a request to retrieve data using urllib.request.urlopen():

import urllib.request
sample_url = "http://httpbin.org/xml"
response = urllib.request.urlopen(sample_url)
status_code = response.status # check the status code
print(status_code)
# if no error, then read the response content
if status_code >= 200 and status_code < 300:
    print(response.getheaders()) # get the headers
    print(response.getheader("Content-length")) # get the header value
    print(response.getheader("Content-Type")) # get the content type
    # read the data from the URL
    data = response.read().decode("utf-8")
    print(data)

Printed result is like this:

200
[('Date', 'Sat, 22 Jan 2022 20:31:43 GMT'), ('Content-Type', 'application/xml'), ('Content-Length', '522'), ('Connection', 'close'), ('Server', 'gunicorn/19.9.0'), ('Access-Control-Allow-Origin', '*'), ('Access-Control-Allow-Credentials', 'true')]
522
application/xml
<?xml version='1.0' encoding='us-ascii'?>

<!--  A SAMPLE set of slides  -->

<slideshow
    title="Sample Slide Show"
    date="Date of publication"
    author="Yours Truly"
    >

    <!-- TITLE SLIDE -->
    <slide type="all">
      <title>Wake up to WonderWidgets!</title>
    </slide>

    <!-- OVERVIEW -->
    <slide type="all">
        <title>Overview</title>
        <item>Why <em>WonderWidgets</em> are great</item>
        <item/>
        <item>Who <em>buys</em> WonderWidgets</item>
    </slide>

</slideshow>

Parsing HTML

We can work with HTML data via the HTMLParser module. For example,

# working with HTML data via the HTMLParser
from html.parser import HTMLParser

metacount = 0

# define a class that will handle various parts of an HTML file
# create a subclass of HTMLParser and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        global metacount
        if tag == "meta":
            metacount += 1

        print("Encountered a start tag:", tag)
        pos = self.getpos()  # returns a tuple indication line and character
        print("\tAt line: ", pos[0], " position ", pos[1])

        if len(attrs) > 0:
            print("\tAttributes:")
            for a in attrs:
                print("\t", a[0], "=", a[1])

    # function to handle the ending tag
    def handle_endtag(self, tag):
        print("Encountered an end tag:", tag)

    # function to handle character and text data (tag contents)
    def handle_data(self, data):
        if (data.isspace()):
            return
        print("Encountered some text data:", data)

    # function to handle the processing of HTML comments
    def handle_comment(self, data):
        print("Encountered comment:", data)

# create an instance of the parser
parser = MyHTMLParser()
# open the sample HTML file and read it
f = open("samplehtml.html")
if f.mode == "r":
    contents = f.read()  # read the entire file
    parser.feed(contents)

print(f"{metacount} meta tags encountered")

Using JSON

JSON is acronym for JavaScript Object Notation. It is a lightweight data-interchange format. json is a Python standard library that provides a number of functions for working with JSON data.

Example:

# working with JSON data
import urllib.request
import json

# use urllib to retrieve some sample JSON data
req = urllib.request.urlopen("http://httpbin.org/json")
data = req.read().decode('utf-8')
print(data)

# use the JSON module to parse the returned data
obj = json.loads(data)

# when the data is parsed, we can access it like any other object
print(obj["slideshow"]["author"])
for slide in obj["slideshow"]["slides"]:
    print(slide["title"])

# python objects can also be written out as JSON
objdata = {
    "name": "Joe Marini",
    "author": True,
    "titles": [
        "Learning Python", "Advanced Python", "Python Standard Library Essential Training"
    ]
}

with open("jsonoutput.json", "w") as fp:
    json.dump(objdata, fp, indent=4)

Last updated