Python Unit 3

Object-oriented programming

Python supports, amongst others, the object-oriented programming paradigm. While hundreds of books have been written on this paradigm, it is actually quite easy to grasp: Think of a program as a collection of objects that communicate with each other. Each object has

  • attributes describing its state, and
  • methods that allow to change these attributes.

Before we can create an object, we first have to define a “blueprint” (class) for a certain type of object. Then, we may create one or more objects from this class via a process known as instantiation. Each of these objects will have the same set of attributes and methods, but the attributes will typically differ in their values.

Since all those concepts may sound quite abstract at first, here’s a comprehensive example. We will create a class that represents proteins. Our implementation remains simple, but you will have the opportunity to add functionality in the exercises.

All statements of this section (i.e., the class definitions) should be stored in simple_protein.py.

Attributes and the init method

A protein is characterized by its name and amino acid sequence. Hence, the protein class needs at least two attributes:

class Protein:
    def __init__(self, name, sequence):
        self.name = name
        self.sequence = sequence

The code above defines a new class:

  • keyword class
  • class name (here Protein)
  • colon
  • indented statements (suite, class body)

We also define an init method, which must be named __init__() and is called by Python whenever a new object is instantiated from the class. In the body of the init method, we initialize the instance variables self.name and self.sequence. These variables store the name and sequence of an object and thereby completely describe its state.

We create a new object ins_A of class Protein. To this end, we call the class name like a function and supply the arguments required by the init method. Once the object ins_A has been created, we may access its attributes ([object name].[attribute name]).

ins_A = Protein(name="insulin A chain", sequence="GIVEQCCTSICSLYQLENYCN")
print(ins_A.name)
print(ins_A.sequence)
print(type(ins_A))

The built-in function type() tells us the class from which ins_A has been derived.

Important

You may wonder what happened to the first parameter (self) of __init__(). Why did we only specify two arguments (which were obviously assigned to the parameters name und sequence) when we instantiated the insulin object? Each instance method must be defined with at least one parameter (commonly called self), which refers to the current instance. However, we must not pass an argument to this parameter when invoking the instance method – Python supplies the appropriate value “automatically”.

Methods

Let’s add a method to our class. (Simply insert the definition of mutate() into your file simple_protein.py. Mind the correct indentation level!)

class Protein:
    def __init__(self, name, sequence):
        self.name = name
        self.sequence = sequence

    # only add the following lines to your file
    def mutate(self, pos, residue):
        self.sequence = self.sequence[:pos] + residue + self.sequence[pos+1:]

The method mutate() creates a point mutation by replacing the amino acid at position pos by residue.

ins_A = Protein("insulin A chain", "GIVEQCCTSICSLYQLENYCN")
print(ins_A.sequence)

ins_A.mutate(2, "W")
print(ins_A.sequence)

mutate() works as expected: It changed the third residue (i.e., valine with index 2) to tryptophane.

In the following sections, we will learn about classes that are available from the standard library.

Regular expressions

Introduction

The re module (from the standard library) implements classes and functions for handling regular expressions. As you might remember from the bash course, a regular expression (“regex”) is a string that defines a search pattern and thereby describes sets of other strings.

The re module is quite comprehensive. Here, we only present some of its functions; for further information, please consult the documentation and the article Regular Expression HOWTO.

findall() searches for all non-overlapping occurrences of a regular expression (first argument) in a string (second argument) and returns all matches as list.

from re import findall

findall(r".ython", "Python or rather Jython?")

It is recommended to define the regular expression by using a raw string (i.e., a string that includes a r immediately left of the opening quote). Within a raw string, the backslash is not interpreted as escape character, but as an actual backslash. You will see below why this approach is advantageous.

Syntax

Python’s syntax for regular expressions is similar to the bash syntax:

  • The dot . denotes an arbitrary character (see example above).

  • Character classes are assembled by surrounding a list of characters in brackets.

    findall(r"H[au]nd", "Hand Hund Hend")

    You may also specify a range of characters …

    findall(r"H[a-t]nne", "Hanne Henne Hunne")  # characters a to t

    … or exclude certain characters.

    findall(r"H[^e]nne", "Hanne Henne Hunne")   # all characters except e

    Several predefined character classes are available, such as \d (all decimal numbers) and \w (all alphanumeric characters, including the underscore). If you want to use these classes, raw strings are particularly useful:

    findall("\\d\\w", "this is 1a")  # many backslashes, right?
    findall(r"\d\w", "this is 1a")   # better!
  • Quantifiers following a character indicate how often this character may appear.

    target = "bt bat baat baaat baaaat baaaaaaaaat"
    findall(r"ba*t", target)     # zero or more times
    findall(r"ba+t", target)     # one or more times
    findall(r"ba?t", target)     # zero or once
    findall(r"ba{3}t", target)   # exactly three times
    findall(r"ba{2,}t", target)  # at least two times
    findall(r"ba{,3}t", target)  # up to three times
    findall(r"ba{2,4}t", target) # two to four times
  • Alternatives search for one of two different strings.

    findall(r"one|two", "one two three")
  • Anchors restrict matches to the start or end of a string:

    findall(r"^a..", "auf dem Haus")  # anchor at the start
    findall(r"a..$", "auf dem Haus")  # anchor at the end
  • Groups are created by parentheses surrounding one or more characters. Groups may be modified by a quantifier and allow to extract parts of a match:

    findall(r"animal: (\w+) (\w+)", "animal: Mus musculus (mouse)")
Example

In the code below, we extract all codes that start with the character 2 from a string. The resulting list contains three elements, each of which is a tuple with two strings representing the first and second matched group, respectively. Of course, we can interate over the list of matches.

codes = "1a3, 1b4, 2c5, 2d6, 2e7, 3f8, 3g9"
codes_starting_with_two = findall(r"(2.)(.)", codes)
codes_starting_with_two
#> [('2c', '5'), ('2d', '6'), ('2e', '7')]

for code in codes_starting_with_two:
    print(code[1])
#> 5
#> 6
#> 7

Pattern objects

Functions like findall() are useful for simple searches. For more complex cases, we prefer to create a pattern object from a regular expression and then use its methods.

from re import compile

p = compile(r"animal: (\w+) (\w+)")
type(p)
print(p.pattern)  # attribute: regex used for searching
print(p.groups)   # attribute: number of groups

The pattern object has a method search() that searches for matches in a string and returns a match object upon success.

m = p.search("animal: Mus musculus (mouse)")
type(m)

Let’s have a closer look at the match object:

m.groups()  # tuple with found groups
m[0]        # whole match
m[1]        # first group
m.span()    # start and end index of the whole match
m.start(2)  # start index of the second group
m.end(1)    # end index of the first group

Reading and writing files

How to read individual characters

By reading user input via input() or from command line arguments (sys.argv) and by displaying output on the console (print()), a program may interact with the user. However, as soon as a program has to read large datasets, it it will become infeasible to enter those data manually. Thus, Python is able to read files existing files and to create new files and write contents to them.

Please copy the file codon_table.csv from the folder /resources/python into the folder python3 in your home directory and have a look at its first five lines:

AAA,Lys,K,Lysine
AAC,Asn,N,Asparagine
AAG,Lys,K,Lysine
AAT,Asn,N,Asparagine
ACA,Thr,T,Threonine

This file apparently represents a codon table, where an amino acid is specified for each base triplet. We open the file by using the built-in function open(), which returns a file object:

f = open("codon_table.csv", "r")
f

The second argument of open() sets the file access mode – here, we only want to read the file. Python treats an opened file as a data stream, i.e., a (one-dimensional) sequence of data (in our case, characters) that can be accessed randomly.

Python may read from a file in the following manner: Immediately after a file has been opened, the current stream position points to the first character (index 0), as the method tell(), well, tells us:

f.tell()

We read the first 30 characters via read() and store them in the variable s:

s = f.read(30)
s

These 30 characters comprise the first line and part of the second line. After the reading operation, the stream position lies at the 31st character (index 30):

f.tell()

We may, for example, go back to index 15 and read ten characters, or jump forward to index 100 and read two characters:

f.seek(15)
f.read(10)
f.tell()

f.seek(100)
f.read(2)
f.tell()

How to read a file line by line

In many cases, we will not need random access to the file. Since text files generally are composed of lines, they are usually read line by line (method readline()):

f.seek(0)
l1 = f.readline()
l1

l2 = f.readline()
l2
f.tell()

The strings l1 and l2 now contain the first and second line of the file, respectively, including the end of line character \n. The stream position points to the first character in the third line. Note: Unfortunately, the end of a line is marked by different characters depending on the operation system. While Linux uses \n (line feed), Windows uses \r\n (carriage return plus line feed), and older versions of operating systems on Mac even used \r. Thus, you may run into problems if you process a file that has been created in Windows on Linux.

Fortunately, we don’t have to repeatedly call readline() to eventually read all files from: A file object may be iterated line by line, which facilitates reading within a for loop:

f.seek(0)
for line in f:
  print(line)

Even more simply, we may save all lines in a list via readlines():

f.seek(0)
all_lines = f.readlines()
all_lines

If we don’t need a file any longer, we must close it:

f.close()

How to write to a file

In order to be able to write to a file, we have to open it with the respective access mode:

f = open("some_lines.txt", "w")

Now the method write() allows us to write strings to the file:

f.write("First line\n")
f.write("Second line\n")

If we wish to write a collection of strings (e.g., a list of strings), the method writelines() allows us to do so. Attention: Python will write the list elements one after each other to the file, without inserting a new line character. Thus, if the list elements represent individual lines, we have to add an explicit newline character to the list elements:

lines = ["three\n", "four\n", "five\n"]
f.writelines(lines)

Once we are done, we have to close the file. This is especially important after we have written data to the file, since Python may postpone writing operations. close() ensures that any pending operations are finished.

f.close()

Finally, have a look at the created file (cat).

Context objects

Context objects make file access even simpler. Instead of opening, reading/writing, and closing, a with block is used:

with open("codon_table.csv", "r") as f:
    all_lines = f.readlines()

for line in all_lines:
    print(line)

Create a context object by writing - keyword with - function that returns a context object (here: open()) - keyword as - name for the context object (here: f) - colon - indented statements (suite)

Within the suite, the context object is defined (in our example, the open file is available for reading). As soon as the control flow exits the suite, Python makes sure that the context object is deinitialized (in the case of files, any pending writing operations are executed, and the file is closed).

Exercises

Store all files that you generate for unit 3 in the folder python3 in your home directory.

Exercise 3.1 (3 P)

You have successfully solved this exercise as soon as you have worked through this unit. In particular, the folder python3 must contain the following files, which you have created in the course of this unit:

  • codon_table.csv
  • simple_protein.py
  • some_lines.txt

Exercise 3.2 (4 P)

Create a class Protein with the following properties:

  • The init method requires the arguments name, uniprot_id, and sequence, which are stored in attributes of the same name.
  • The method get_length returns the number of amino acids in the protein.
  • The method contains has one parameter peptide and returns True if the protein sequence contains the given peptide sequence.
  • The method get_mw returns the molecular weight of the protein. The method has an optional parameter disulfides that specifies the number of disulfide bridges.

The molecular weight of a protein is given by the following formula:

sum of the masses of the amino acid residues
+ mass of a water molecule
- 2 * mass of hydrogen * number of disulfide bridges

Please use the following masses:

AA_MASS = dict(
    G=57.05132, A=71.0779, S=87.0773, P=97.11518, V=99.13106,
    T=101.10388, C=103.1429, L=113.15764, I=113.15764, N=114.10264,
    D=115.0874, Q=128.12922, K=128.17228, E=129.11398, M=131.19606,
    H=137.13928, F=147.17386, R=156.18568, Y=163.17326, W=186.2099
)

WATER_MASS = 18.01528
H_MASS = 1.00784

Store the class in exercise_3_2.py.

You may check that your program works correctly by using the following exemplary calls:

galanin = Protein("Galanin", "P22466", "GWTLNSAGYLLGPHAVGNHRSFSDKNGLTS")
print(type(galanin))
#> <class '__main__.Protein'>

print(galanin.get_length())
#> 30

print(galanin.contains("CGSHLV"))
#> False

print(galanin.get_mw())
#> 3157.4102999999996


insulin_B = Protein("Insulin B chain", "P01308", "FVNQHLCGSHLVEALYLVCGERGFFYTPKT")
print(insulin_B.get_length())
#> 30

print(insulin_B.contains("CGSHLV"))
#> True

print(insulin_B.get_mw(disulfides=1))
#> 3427.9056799999994
  • You may build upon the Protein class we defined above. The init method has to define three attributes.
  • You have also learned how to implement methods. Think about the number of parameters, required vs optional, and the return value – these decisions are the same as for function design.
  • In get_mw(), you likely will need a counter variable and have to access dict values.

Exercise 3.3 (3 P)

Implement a function read_masses which reads a table of atomic weights.

  • The function has a single parameter, which receives the name of the file containing the table.

  • A table in CSV format is provided (/resources/python/average_mass.csv), whose first five lines read

    H,1.008
    He,4.0026
    Li,6.94
    Be,9.0122
    B,10.81

    Each line contains two records separated by a comma: The element symbol and the average atomic mass. Therefore, you may process each line with a regular expression containing two groups.

  • The function should return a dict, where element symbols and masses are the key-value pairs.

Store the function in exercise_3_3.py.

You may check that your program works correctly by using the following exemplary calls:

m = read_masses("average_mass.csv")
print(m["N"])
#> 14.007

print(m["S"])
#> 32.06

print(2 * m["H"] + m["O"])
#> 18.015
  • You will have to open the file and read its contents.
  • The regex for parsing each line will contain two groups that capture the element and weight, respectively.
  • Combine your knowledge on counter variables and dicts.

Exercise 3.4 (3 P)

Implement a function calculate_mass, which calculates the mass of a chemical compound. The function is called with a string that contains the molecular formula (e.g., "C6 H12 O6" or "C Cl4") and returns the mass as a float.

Hints:

  • Extract individual element symbols and their counts via a regular expression containing two groups. The first group should match one uppercase letter followed by an optional lowercase letter. The second group should match any number of digits.

Please use the following masses:

MASSES = dict(
    H=1.008, He=4.0026, Li=6.94, Be=9.0122, B=10.81,
    C=12.011, N=14.007, O=15.999, F=18.998, Ne=20.18,
    Na=22.99, Mg=24.305, Al=26.982, Si=28.085, P=30.974,
    S=32.06, Cl=35.45, Ar=39.948, K=39.098, Ca=40.078
)

Store the function in exercise_3_4.py.

You may check that your program works correctly by using the following exemplary calls:

print(calculate_mass("C6 H12 O6"))
#> 180.156

print(calculate_mass("H2 O"))
#> 18.015

print(calculate_mass("C34 H46 Cl N3 O10"))
#> 692.203
  • For designing the regex to extract individual elements and their counts, think about how they are written: Each element starts with one uppercase letter, followed by one optional lowercase letter. The count has an arbitrary number of digits (i.e., it may also be missing, which indicates a count of 1). You will have to open the file and read its contents.
  • You will have to iterate over findall() results.