Python Unit 3
Object-oriented programming
Python supports, amongst others, the object-oriented programming paradigm. While hundreds of books have been written on this paradigm, it is actually quite easy to grasp: Think of a program as a collection of objects that communicate with each other. Each object has
- attributes describing its state, and
- methods that allow to change these attributes.
Before we can create an object, we first have to define a “blueprint” (class) for a certain type of object. Then, we may create one or more objects from this class via a process known as instantiation. Each of these objects will have the same set of attributes and methods, but the attributes will typically differ in their values.
Since all those concepts may sound quite abstract at first, here’s a comprehensive example. We will create a class that represents proteins. Our implementation remains simple, but you will have the opportunity to add functionality in the exercises.
All statements of this section (i.e., the class definitions) should be stored in simple_protein.py
.
Attributes and the init method
A protein is characterized by its name and amino acid sequence. Hence, the protein class needs at least two attributes:
class Protein:
def __init__(self, name, sequence):
self.name = name
self.sequence = sequence
The code above defines a new class:
- keyword
class
- class name (here
Protein
) - colon
- indented statements (suite, class body)
We also define an init method, which must be named __init__()
and is called by Python whenever a new object is instantiated from the class. In the body of the init method, we initialize the instance variables self.name
and self.sequence
. These variables store the name and sequence of an object and thereby completely describe its state.
We create a new object ins_A
of class Protein
. To this end, we call the class name like a function and supply the arguments required by the init method. Once the object ins_A
has been created, we may access its attributes ([object name].[attribute name]
).
= Protein(name="insulin A chain", sequence="GIVEQCCTSICSLYQLENYCN")
ins_A print(ins_A.name)
print(ins_A.sequence)
print(type(ins_A))
The built-in function type()
tells us the class from which ins_A
has been derived.
You may wonder what happened to the first parameter (self
) of __init__()
. Why did we only specify two arguments (which were obviously assigned to the parameters name
und sequence
) when we instantiated the insulin object? Each instance method must be defined with at least one parameter (commonly called self
), which refers to the current instance. However, we must not pass an argument to this parameter when invoking the instance method – Python supplies the appropriate value “automatically”.
Methods
Let’s add a method to our class. (Simply insert the definition of mutate()
into your file simple_protein.py
. Mind the correct indentation level!)
class Protein:
def __init__(self, name, sequence):
self.name = name
self.sequence = sequence
# only add the following lines to your file
def mutate(self, pos, residue):
self.sequence = self.sequence[:pos] + residue + self.sequence[pos+1:]
The method mutate()
creates a point mutation by replacing the amino acid at position pos
by residue
.
= Protein("insulin A chain", "GIVEQCCTSICSLYQLENYCN")
ins_A print(ins_A.sequence)
2, "W")
ins_A.mutate(print(ins_A.sequence)
mutate()
works as expected: It changed the third residue (i.e., valine with index 2) to tryptophane.
In the following sections, we will learn about classes that are available from the standard library.
Regular expressions
Introduction
The re
module (from the standard library) implements classes and functions for handling regular expressions. As you might remember from the bash course, a regular expression (“regex”) is a string that defines a search pattern and thereby describes sets of other strings.
The re
module is quite comprehensive. Here, we only present some of its functions; for further information, please consult the documentation and the article Regular Expression HOWTO.
findall()
searches for all non-overlapping occurrences of a regular expression (first argument) in a string (second argument) and returns all matches as list.
from re import findall
r".ython", "Python or rather Jython?") findall(
It is recommended to define the regular expression by using a raw string (i.e., a string that includes a r
immediately left of the opening quote). Within a raw string, the backslash is not interpreted as escape character, but as an actual backslash. You will see below why this approach is advantageous.
Syntax
Python’s syntax for regular expressions is similar to the bash syntax:
The dot
.
denotes an arbitrary character (see example above).Character classes are assembled by surrounding a list of characters in brackets.
r"H[au]nd", "Hand Hund Hend") findall(
You may also specify a range of characters …
r"H[a-t]nne", "Hanne Henne Hunne") # characters a to t findall(
… or exclude certain characters.
r"H[^e]nne", "Hanne Henne Hunne") # all characters except e findall(
Several predefined character classes are available, such as
\d
(all decimal numbers) and\w
(all alphanumeric characters, including the underscore). If you want to use these classes, raw strings are particularly useful:"\\d\\w", "this is 1a") # many backslashes, right? findall(r"\d\w", "this is 1a") # better! findall(
Quantifiers following a character indicate how often this character may appear.
= "bt bat baat baaat baaaat baaaaaaaaat" target r"ba*t", target) # zero or more times findall(r"ba+t", target) # one or more times findall(r"ba?t", target) # zero or once findall(r"ba{3}t", target) # exactly three times findall(r"ba{2,}t", target) # at least two times findall(r"ba{,3}t", target) # up to three times findall(r"ba{2,4}t", target) # two to four times findall(
Alternatives search for one of two different strings.
r"one|two", "one two three") findall(
Anchors restrict matches to the start or end of a string:
r"^a..", "auf dem Haus") # anchor at the start findall(r"a..$", "auf dem Haus") # anchor at the end findall(
Groups are created by parentheses surrounding one or more characters. Groups may be modified by a quantifier and allow to extract parts of a match:
r"animal: (\w+) (\w+)", "animal: Mus musculus (mouse)") findall(
In the code below, we extract all codes that start with the character 2
from a string. The resulting list contains three elements, each of which is a tuple with two strings representing the first and second matched group, respectively. Of course, we can interate over the list of matches.
= "1a3, 1b4, 2c5, 2d6, 2e7, 3f8, 3g9"
codes = findall(r"(2.)(.)", codes)
codes_starting_with_two
codes_starting_with_two#> [('2c', '5'), ('2d', '6'), ('2e', '7')]
for code in codes_starting_with_two:
print(code[1])
#> 5
#> 6
#> 7
Pattern objects
Functions like findall()
are useful for simple searches. For more complex cases, we prefer to create a pattern object from a regular expression and then use its methods.
from re import compile
= compile(r"animal: (\w+) (\w+)")
p type(p)
print(p.pattern) # attribute: regex used for searching
print(p.groups) # attribute: number of groups
The pattern object has a method search()
that searches for matches in a string and returns a match object upon success.
= p.search("animal: Mus musculus (mouse)")
m type(m)
Let’s have a closer look at the match object:
# tuple with found groups
m.groups() 0] # whole match
m[1] # first group
m[# start and end index of the whole match
m.span() 2) # start index of the second group
m.start(1) # end index of the first group m.end(
Reading and writing files
How to read individual characters
By reading user input via input()
or from command line arguments (sys.argv
) and by displaying output on the console (print()
), a program may interact with the user. However, as soon as a program has to read large datasets, it it will become infeasible to enter those data manually. Thus, Python is able to read files existing files and to create new files and write contents to them.
Please copy the file codon_table.csv
from the folder /resources/python
into the folder python3
in your home directory and have a look at its first five lines:
AAA,Lys,K,Lysine
AAC,Asn,N,Asparagine
AAG,Lys,K,Lysine
AAT,Asn,N,Asparagine ACA,Thr,T,Threonine
This file apparently represents a codon table, where an amino acid is specified for each base triplet. We open the file by using the built-in function open()
, which returns a file object:
= open("codon_table.csv", "r")
f f
The second argument of open()
sets the file access mode – here, we only want to read the file. Python treats an opened file as a data stream, i.e., a (one-dimensional) sequence of data (in our case, characters) that can be accessed randomly.
Python may read from a file in the following manner: Immediately after a file has been opened, the current stream position points to the first character (index 0), as the method tell()
, well, tells us:
f.tell()
We read the first 30 characters via read()
and store them in the variable s
:
= f.read(30)
s s
These 30 characters comprise the first line and part of the second line. After the reading operation, the stream position lies at the 31st character (index 30):
f.tell()
We may, for example, go back to index 15 and read ten characters, or jump forward to index 100 and read two characters:
15)
f.seek(10)
f.read(
f.tell()
100)
f.seek(2)
f.read( f.tell()
How to read a file line by line
In many cases, we will not need random access to the file. Since text files generally are composed of lines, they are usually read line by line (method readline()
):
0)
f.seek(= f.readline()
l1
l1
= f.readline()
l2
l2 f.tell()
The strings l1
and l2
now contain the first and second line of the file, respectively, including the end of line character \n
. The stream position points to the first character in the third line. Note: Unfortunately, the end of a line is marked by different characters depending on the operation system. While Linux uses \n
(line feed), Windows uses \r\n
(carriage return plus line feed), and older versions of operating systems on Mac even used \r
. Thus, you may run into problems if you process a file that has been created in Windows on Linux.
Fortunately, we don’t have to repeatedly call readline()
to eventually read all files from: A file object may be iterated line by line, which facilitates reading within a for
loop:
0)
f.seek(for line in f:
print(line)
Even more simply, we may save all lines in a list via readlines()
:
0)
f.seek(= f.readlines()
all_lines all_lines
If we don’t need a file any longer, we must close it:
f.close()
How to write to a file
In order to be able to write to a file, we have to open it with the respective access mode:
= open("some_lines.txt", "w") f
Now the method write()
allows us to write strings to the file:
"First line\n")
f.write("Second line\n") f.write(
If we wish to write a collection of strings (e.g., a list of strings), the method writelines()
allows us to do so. Attention: Python will write the list elements one after each other to the file, without inserting a new line character. Thus, if the list elements represent individual lines, we have to add an explicit newline character to the list elements:
= ["three\n", "four\n", "five\n"]
lines f.writelines(lines)
Once we are done, we have to close the file. This is especially important after we have written data to the file, since Python may postpone writing operations. close()
ensures that any pending operations are finished.
f.close()
Finally, have a look at the created file (cat
).
Context objects
Context objects make file access even simpler. Instead of opening, reading/writing, and closing, a with
block is used:
with open("codon_table.csv", "r") as f:
= f.readlines()
all_lines
for line in all_lines:
print(line)
Create a context object by writing - keyword with
- function that returns a context object (here: open()
) - keyword as
- name for the context object (here: f
) - colon - indented statements (suite)
Within the suite, the context object is defined (in our example, the open file is available for reading). As soon as the control flow exits the suite, Python makes sure that the context object is deinitialized (in the case of files, any pending writing operations are executed, and the file is closed).
Exercises
Store all files that you generate for unit 3 in the folder python3
in your home directory.
Exercise 3.1 (3 P)
You have successfully solved this exercise as soon as you have worked through this unit. In particular, the folder python3
must contain the following files, which you have created in the course of this unit:
codon_table.csv
simple_protein.py
some_lines.txt
Exercise 3.2 (4 P)
Create a class Protein
with the following properties:
- The init method requires the arguments
name
,uniprot_id
, andsequence
, which are stored in attributes of the same name. - The method
get_length
returns the number of amino acids in the protein. - The method
contains
has one parameterpeptide
and returnsTrue
if the protein sequence contains the givenpeptide
sequence. - The method
get_mw
returns the molecular weight of the protein. The method has an optional parameterdisulfides
that specifies the number of disulfide bridges.
The molecular weight of a protein is given by the following formula:
sum of the masses of the amino acid residues
+ mass of a water molecule
- 2 * mass of hydrogen * number of disulfide bridges
Please use the following masses:
= dict(
AA_MASS =57.05132, A=71.0779, S=87.0773, P=97.11518, V=99.13106,
G=101.10388, C=103.1429, L=113.15764, I=113.15764, N=114.10264,
T=115.0874, Q=128.12922, K=128.17228, E=129.11398, M=131.19606,
D=137.13928, F=147.17386, R=156.18568, Y=163.17326, W=186.2099
H
)
= 18.01528
WATER_MASS = 1.00784 H_MASS
Store the class in exercise_3_2.py
.
You may check that your program works correctly by using the following exemplary calls:
= Protein("Galanin", "P22466", "GWTLNSAGYLLGPHAVGNHRSFSDKNGLTS")
galanin print(type(galanin))
#> <class '__main__.Protein'>
print(galanin.get_length())
#> 30
print(galanin.contains("CGSHLV"))
#> False
print(galanin.get_mw())
#> 3157.4102999999996
= Protein("Insulin B chain", "P01308", "FVNQHLCGSHLVEALYLVCGERGFFYTPKT")
insulin_B print(insulin_B.get_length())
#> 30
print(insulin_B.contains("CGSHLV"))
#> True
print(insulin_B.get_mw(disulfides=1))
#> 3427.9056799999994
- You may build upon the
Protein
class we defined above. The init method has to define three attributes. - You have also learned how to implement methods. Think about the number of parameters, required vs optional, and the return value – these decisions are the same as for function design.
- In
get_mw()
, you likely will need a counter variable and have to access dict values.
Exercise 3.3 (3 P)
Implement a function read_masses
which reads a table of atomic weights.
The function has a single parameter, which receives the name of the file containing the table.
A table in CSV format is provided (
/resources/python/average_mass.csv
), whose first five lines readH,1.008 He,4.0026 Li,6.94 Be,9.0122 B,10.81
Each line contains two records separated by a comma: The element symbol and the average atomic mass. Therefore, you may process each line with a regular expression containing two groups.
The function should return a dict, where element symbols and masses are the key-value pairs.
Store the function in exercise_3_3.py
.
You may check that your program works correctly by using the following exemplary calls:
= read_masses("average_mass.csv")
m print(m["N"])
#> 14.007
print(m["S"])
#> 32.06
print(2 * m["H"] + m["O"])
#> 18.015
- You will have to open the file and read its contents.
- The regex for parsing each line will contain two groups that capture the element and weight, respectively.
- Combine your knowledge on counter variables and dicts.
Exercise 3.4 (3 P)
Implement a function calculate_mass
, which calculates the mass of a chemical compound. The function is called with a string that contains the molecular formula (e.g., "C6 H12 O6"
or "C Cl4"
) and returns the mass as a float.
Hints:
- Extract individual element symbols and their counts via a regular expression containing two groups. The first group should match one uppercase letter followed by an optional lowercase letter. The second group should match any number of digits.
Please use the following masses:
= dict(
MASSES =1.008, He=4.0026, Li=6.94, Be=9.0122, B=10.81,
H=12.011, N=14.007, O=15.999, F=18.998, Ne=20.18,
C=22.99, Mg=24.305, Al=26.982, Si=28.085, P=30.974,
Na=32.06, Cl=35.45, Ar=39.948, K=39.098, Ca=40.078
S )
Store the function in exercise_3_4.py
.
You may check that your program works correctly by using the following exemplary calls:
print(calculate_mass("C6 H12 O6"))
#> 180.156
print(calculate_mass("H2 O"))
#> 18.015
print(calculate_mass("C34 H46 Cl N3 O10"))
#> 692.203
- For designing the regex to extract individual elements and their counts, think about how they are written: Each element starts with one uppercase letter, followed by one optional lowercase letter. The count has an arbitrary number of digits (i.e., it may also be missing, which indicates a count of 1). You will have to open the file and read its contents.
- You will have to iterate over
findall()
results.