Processing Text

This section discusses the string processing functions in Python.

Python text processing

Python text processing can be used:

  1. Searching strings for specific information

  2. Validating patterns of characters, digits, etc.

  3. Transforming content between different formats

  4. Manipulating and extracting string content

  5. Preparing text for display to the user

Basic string operations

Import string module:

import string

The string module contains some common string operations, such as some predefined string constants:

string.ascii_letters
string.ascii_lowercase
string.ascii_uppercase
string.digits
string.hexdigits
string.octdigits
string.punctuation
string.printable
string.whitespace

In Visual Studio Code, you can use Ctrl + Shift + P. Open Command Palette, enter Python, then select Run Python File in Terminal, you can run Python code in Terminal.

If you print the above string, you can see their value:

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789
0123456789abcdefABCDEF
01234567
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Extract ASCII letters in strings:

s = "The quick brown fox jumps over the lazy dog on the 1st of December"
result = "".join(c for c in s if c in string.ascii_letters)
print(result)

The output result is:

ThequickbrownfoxjumpsoverthelazydogonthestofDecember

This takes all the ASCII letters, and other numbers and spaces are excluded.

Check if only letters and numbers are included in the string, whether only letters are included, whether only numbers are included, respectively:

result = s.isalnum()
result = s.isalpha()
result = s.isnumeric()

The output result is a BOOL value, for example, if the first Result is true, the character string contains only letters and numbers, if false, indicating characters other than the letters and numbers.

You can also use the all function to check if you only contain letters and numbers in the string:

result = all([c.isalnum() for c in s])

Searching strings

StartSwith and Endswith functions can be used to check whether the string starts or ends at the specified string, if yes, return true, otherwise returning false. For example:

s = "The quick brown fox jumps over the lazy dog"
result = s.startswith("The")
print(result)
result = s.endswith("on")
print(result)

The two outputs are: true and false.

find and rfind functions can be used to check if the string contains the specified string, if yes, return the index of the first match, otherwise returns -1. find function looks from left to right, rfind functions looks from right to left. For example:

s = "The quick brown fox jumps over the lazy dog"
result = s.find("the")
print(result)
result = s.rfind("the")
print(result)

The output results are: 31 and 31.

Check if you contain a specified string in the string:

s = "The quick brown fox jumps over the lazy dog"
result = "the" in s
print(result)

The output result is: True.

Replace function can be used to replace the specified string in the string, for example:

s = "The quick brown fox jumps over the lazy dog"
result = s.replace("the", "a")
print(result)

The output result is: The quick brown fox jumps over a lazy dog

The count function can be used to count the appearance of the specified string in the string, for example:

s = "The quick brown fox jumps over the lazy dog"
result = s.count("the")
print(result)

The output result is: 1.

String manipulation

upper and lower functions can be used to convert uppercase letters in the string into lowercase letters, for example:

s = "The quick brown fox jumps over the lazy dog"
result = s.upper()
print(result)
result = s.lower()
print(result)

The output result is: THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG and the quick brown fox jumps over the lazy dog.

split and join can split and merge a string. The default separator of splits is space. The string before join is used to specify the separator. For example:

s = "The quick brown fox jumps over the lazy dog"
result = s.split()
print(result)
result = " ".join(s.split())
print(result)

The output result is: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'] and The quick brown fox jumps over the lazy dog

ljust, rjust and center function can be used to specify the left, right or center alignment of the string, for example:

names = ["John", "Paul", "George", "Ringo"]
biggest_name = max(len(name) for name in names)
for name in names:
    print(name.ljust(biggest_name+2, "-") + ":")
for name in names:
    print(name.rjust(biggest_name+2, "-") + ":")
for name in names:
    print(name.center(biggest_name+2, "-") + ":")

The output result is:

John----:
Paul----:
George--:
Ringo---:
----John:
----Paul:
--George:
---Ringo:
--John--:
--Paul--:
-George-:
-Ringo--:

We can use the Translation Table to replace characters in the string, for example:

s = "The quick brown fox jumps over the lazy dog"
result = s.translate(str.maketrans("aeiou", "12345"))
print(result)

The output result is: Th2 q53ck br4wn f4x j5mps 4v2r th2 l1zy d4g

String formatting

We can use the template string to format the string, firstly import the template class of the String module:

from string import Template

Use the Template class to create a template, for example:

t = Template("$x, glorious $x!")

Then use the substitute method to replace the variables in the template, for example:

result = t.substitute(x="world")
print(result)

The output result is: world, glorious world!

You can also create a Dictionary to replace variables in the template, for example:

args = {"x": "world"}
result = t.substitute(args)
print(result)

The output result is: world, glorious world!, the same as the above output.

The format method can also be used to format strings, for example:

foo = "foo"
bar = 123
result = "The {} number is {}".format(foo, bar)
print(result)
result = "The {1} number is {0}".format(foo, bar)
print(result)
result = "The {var1} number is {var2}".format(var1=foo, var2=bar)
print(result)
result = "The {var2:x} number is {var2:X}".format(**{"var1": foo, "var2": bar})
print(result)

The output result is: The foo number is 123, The 123 number is foo, The foo number is 123 and The 7b number is 7B. The last one is to convert the numbers in the string into hexadecimal form.

String interpolation

This feature requires at least Python 3.6 version. Import the datetime module (later needs to format the time variable):

import datetime

Format strings using f-String, for example:

now = datetime.datetime.now()
product = "coffee"
price = 4.99
tax = 0.07
result = f"Today's date is {now.strftime('%Y-%m-%d')} and the product is {product} and the price is {price} with tax {price * tax:.2f}"
print(result)

The output result is: Today's date is 2021-11-17 and the product is coffee and the price is 4.99 with tax 0.35.

More strings-related operations can be referred to Python Document.

Last updated