Python Unit 4
NumPy
The n-dimensional array is the central data structure of NumPy. It may represent vectors (one dimension), matrices (two dimensions), and tensors (three and more dimensions). In addition, NumPy provides a tremendous number of functions and methods for manipulating arrays and performing calculations involving them. The implementation of NumPy is highly efficient, facilitating arrays that contain millions of elements.
Usually, the NumPy module is imported into a namespace np
:
import numpy as np
Create an array
There are various ways to create an array:
- from other data structures (lists, dicts, …)
- via a function (e.g.,
np.zeros()
) - by reading data from a file
= [1, 4, 2, 5, 7]
some_numbers = np.array(some_numbers) # create array from list
a
a
= np.arange(24)
b
b
= np.zeros((3, 5))
c c
np.arange()
takes the same parameters as the range()
function. np.zeros()
creates an array containing zeros. The tuple (3, 5)
, which is passed as argument to this function, specifies the shape of the array, i.e., the number of elements in each dimension (called axes in NumPy). Here, the array contains three rows and five columns.
We may query basic properties of the array via the following attribues:
# number of axes (dimensions)
c.ndim # number of elements
c.size # shape c.shape
The shape of an array may be changed. For instance, a vector may be converted into a matrix, as long as the number of elements remains constant. Thus, we may convert the vector b
(24 elements) to a 6x4 matrix or a 3x8 matrix, but not to a 5x5 matrix:
6, 4)) # ok
b.reshape((3, -1)) # ok; -1 -> automatically determine the required size
b.reshape((5, 5)) # error b.reshape((
Calling an array method returns a new array – the original object remains unchanged.
= b.reshape((6, 4))
b2 # new 6x4 array
b2 # old 24x1 array b
Indexing and slicing
Indexing and slicing return single elements or parts of an array. Remember that Python uses zero-based indexing, and that values are optional for the slicing operator.
= np.arange(12).reshape((4, 3))
d
d
0, 0] # element in the first row and first column
d[2, 1] # element in the third row and second column
d[
0] # first row
d[-1] # last row
d[1:, :2] # from the second row (inclusive) to the third column (exclusive)
d[0] # all rows, first column
d[:, 2] # every second row d[::
Basic operations
Arrays facilitate vector arithmetics, e.g., element-wise addition and multiplication, as well as the scalar product:
= np.array([0, 1, 2, 10, 100])
x = np.array([3, 8, 12, 5, 13])
y
+ y
x * y
x np.dot(x, y)
Here is a selection of methods for numeric arrays:
min() # minimum value
x.max() # maximum value
x.sum() # sum of all values
y.# product of all values
y.prod() # mean of all values y.mean()
Interestingly, calculations may involve arrays of different shapes. For example, the scalar value \(x = 10\) can be multiplied by the vector \(y = (4, 1, 7)\) by using a technique called broadcasting. The smaller array is “broadcast” across the larger array so that they have compatible shapes.
= np.array(10)
x = np.array([4, 1, 7])
y * y # x is extended to [10, 10, 10] before multiplication x
Broadcasting also works in higher dimensions:
1, 2), (3, 4), (5, 6)]) + np.array([10, 100]) np.array([(
Reading and writing
NumPy provides several ways to store the contents of an array in a file. save()
creates a binary file in NPY format, while savetxt()
creates a human-readable text file.
= np.arange(10, 32, 3).reshape((4, 2))
p
p
"my_array.npy", p)
np.save("my_array.txt", p) np.savetxt(
For each of these file formats, there are also reading functions:
"my_array.npy")
np.load("my_array.txt") np.loadtxt(
Pandas
Create a DataFrame
Pandas builds upon NumPy arrays and provides data structures for comprehensive data analyses: Series
(a one-dimensional array with an index) and DataFrame
(a two-dimensional tabular structure). Each row of a DataFrame represents an observation, while each of its columns describes a variable (property) of these observations.
Usually, the Pandas module is imported into a namespace pd
:
import pandas as pd
Pandas provides several possibilities to create a DataFrame
object, e.g., from nested lists or dicts, whose keys become column names:
= [["a", 2, True],
data_list "b", 5, True],
["c", 8, False]]
[
pd.DataFrame(data_list)
= dict(A=["a", "b", "c"],
data_dict =[2, 5, 8],
B=[True, True, False])
C pd.DataFrame(data_dict)
Importantly, a DataFrame may be created by reading a file containing tabular data, such as a file in CSV format. Copy the file amino_acids.csv
from /resources/python
into a subfolder python4
in your home directory and look at its first five lines:
code,abbreviation,name,chem1,chem2,chem3,abundance,pi
A,Ala,alanine,nonpolar,methylene,aliphatic,8.84,6.0
C,Cys,cysteine,polar,sulfur,aliphatic,1.24,5.0
D,Asp,aspartic acid,negative,methylene,aliphatic,5.39,3.0 E,Glu,glutamic acid,negative,methylene,aliphatic,6.24,3.2
Obviously, this file collects chemical properties of the proteinogenic amino acids. Thus, each line describes an amino acid (= observation); each column contains data on one property (= variable), e.g., single letter code or isoelectric point.
Read the contents of this file via pd.read_csv()
:
= pd.read_csv("amino_acids.csv")
df df
read_csv()
has more than 50 optional parameters, which can adjust almost any aspect of the reading process. Thereby, it supports a plethora of CSV dialects that can be found in the wild, since the CSV format has never been standardized.
DataFrame properties
Let’s investigate the DataFrame we just created:
# these methods work
df.head() # similar to the bash commands
df.tail()
# (row) index
df.index # column index (names)
df.columns # values of the elements – a NumPy array
df.values
# number of elements
df.size # number of rows and columns df.shape
Calculate descriptive statistics for the numeric columns:
# median
df.median() # mean
df.mean() # standard deviation df.std()
Calling a method of a DataFrame returns a new object; the original DataFrame remains unchanged.
Selecting data
Pandas provides several ways to select parts of a DataFrame, such as:
select columns via indexing with a string
"code"] # selecting a single column returns a Series df["name", "abundance"]] # selecting several columns returns a DataFrame df[[
select rows by slicing with integers
4] # rows with index 0 to 3 df[:2:6] # rows with index 2 to 5 df[
select rows with Booleans
"pi"] > 7] # selects all rows with basic amino acids df[df[
The statement above consists of three parts:
Select column
pi
(isoelectric point)."pi"] df[
Check each value in this column if it is greater than 7. (Note that the scalar value 7 is broadcast along the vector representing the column.)
"pi"] > 7 df[
Select those rows in
df
where comparison (2) is true.
label-based indexing via the
.loc
attribute2, "code"] # element in the third row and column "code" df.loc[3:6, "code":"chem1"] # elements in rows four to seven df.loc[# and columns "code" to "chem1"
WarningIn contrast to all other slicing operations we have encountered, slicing within
.loc
also includes the end element.position-based indexing via the
.iloc
attribute4] # row with index 4 df.iloc[4] # column with index 4 df.iloc[:, 10:13, :4] # rows with index 10 to 12, columns with index 0 to 3 df.iloc[
Whenever we select a single column of a DataFrame, Python returns a Series
object. This class also implements a range of methods, e.g.,
= df["abundance"] # a Series
abundance sum() # sum of the values in the Series
abundance.
= df["pi"] # another Series
pI # mean of the values in the latter Series pI.mean()
Modifying data
In the previous section, we learned how to select parts of a DataFrame via the index. Thus, it may be useful to convert a column of the DataFrame to a new index, which is done via the set_index()
method:
= df.set_index("abbreviation")
df
df
"Ala"] # select the column describing alanine df.loc[
By contrast, the row index may be converted back to a column:
= df.reset_index()
df df
A DataFrame can be sorted based on the values in a row index, column index, or column. Keyword arguments of the respective method allow to configure the sort.
= df.set_index("abbreviation")
df # sort by row index
df.sort_index() =1) # sort by column index, i.e.,
df.sort_index(axis# order columns by their name
"abundance") # sort ascending by "abundance"
df.sort_values("pi", ascending=False) # sort descending by "pi" df.sort_values(
You may add and remove columns:
"new_col"] = 3 # add a new column "new_col"
df[# the scalar 3 is broadcast
df
= df.drop("new_col", axis=1) # remove the column; axis=1 specifies
df # that we want to remove a column df
Joins
Pandas allows to combine several DataFrames by a so-called join. In order to explain how joins work, load the table with amino acid data:
= pd.read_csv("amino_acids.csv")
df_left df_left
Also load a second CSV file (available in /resources/python
):
= pd.read_csv("amino_acids_chemdata.csv")
df_right df_right
This second table lacks information on several amino acids (such as glycine) and describes two additional compounds “X” and “Z”.
A join is a binary operation and thus requires two operands, which are usually called “left table” and “right table”. In each table, one or several columns are denoted as keys that are compared for joining. In the examples below, column code
represents the key.
The merge()
method allows to perform the following join types:
The left join uses all observations (rows) of the left table and adds information from the right table if the latter contains an observation with matching values in the key columns.
="left", on="code") df_left.merge(df_right, how
The resulting DataFrame contains 20 rows (i.e., the same number as the left table) and 17 columns (i.e., seven more as the left table – those were added from the right table). The right table contained data, e.g., for key
"A"
; therefore, data in its columnsflexibility
,i_vdw
etc. were added to these columns in the resulting DataFrame. By contrast, the right table did not contain data for key"F"
. Therefore, columns in the left table were filled withNaN
for this key. (NaN
means “not a number” and represents missing values in Pandas.)The right join works similar to the left join, but the roles of the left and right table are swapped:
="right", on="code") df_left.merge(df_right, how
The inner join only retains observations whose key exists in both the left and right table:
="inner", on="code") df_left.merge(df_right, how
The outer join retains observations whose key exists in at least one table:
= df_left.merge(df_right, how="outer", on="code") df_outer df_outer
Saving tables
The method to_csv()
saves the contents of a DataFrame into a CSV file:
"joined.csv") df_outer.to_csv(
Similar to the function for reading CSV files, to_csv()
can be modified by many parameters.
Exercises
Store all files that you generate for unit 4 in the folder python4
in your home directory.
For the exercises in Unit 4, you require files available under /resources/python
. Copy those files in the folder ~/python4/
and write your scripts (and file imports), such that they can be run within the folder ~/python4/
(thus accessing and writing files without a preceeding path).
Exercise 4.1 (3 P)
You have successfully solved this exercise as soon as you have worked through this unit. In particular, the folder python4
must contain the following files, which you have created in the course of this unit:
my_array.npy
my_array.txt
amino_acids.csv
amino_acids_chemdata.csv
joined.csv
Exercise 4.2 (3 P)
Implement a function is_normal
that checks whether two columns of a matrix are orthogonal to each other.
- The function should be called with three parameters:
- the name of a file that contains a numeric matrix and can be read via
np.loadtxt()
- index of one column that should be checked
- index of another column
- the name of a file that contains a numeric matrix and can be read via
- The function returns a Boolean value.
- Two vectors are orthogonal to each other if their scalar product is zero.
- Be careful when you check whether the calculated scalar product is zero. Generally, one should never compare two floats via the equality operator (
==
). Rather, use the functionisclose()
provided by themath
module.
Store the function in exercise_4_2.py
.
You may check that your program works correctly by using the following exemplary calls (vectors.txt
is available in /resources/python
):
# compares the vectors [1 0 0 0 0] and [0 0 1 0 0]
print(is_normal("vectors.txt", 0, 2))
#> True
# compares the vectors [3 -2 7 0 1] and [0 0 1 0 0]
print(is_normal("vectors.txt", 1, 2))
#> False
# compares the vectors [5 0 -2 8 -1] and [3 -2 7 0 1]
print(is_normal("vectors.txt", 4, 1))
#> True
- See above how to read an array from a file and slice it. Calculation of the dot product is also explained above.
Exercise 4.3 (4 P)
Implement a function translate()
that translates a DNA sequence into a protein sequence and calculates the isoelectric point of the latter.
While there are many ways to implement such a function, please adhere to the following specifications:
- The function has a single parameter, which receives the DNA sequence as string. You may assume that the length of the string is always divisible by three.
- Within the function, first create a DataFrame with one column containing codons of the DNA sequence. There are several ways to split the sequence into codons, e.g., via a for loop or by using the
wrap()
function from thetextwrap
module. - Load the CSV files
amino_acids.csv
(in yourpython4
folder) andcodon_table_small.csv
(in/resources/python
) as DataFrames. - Join the three DataFrames. Think about the type of joins, their order, and which columns should be used as keys.
- The function must return two values:
- a string containing the protein sequence (you will have to apply the string method
join()
to one of the columns of the join result) - a float giving the mean isoelectric point of the protein sequence (see above)
- a string containing the protein sequence (you will have to apply the string method
Store the function in exercise_4_3.py
.
You may check that your program works correctly by using the following exemplary calls:
print(translate("AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGATCACTGTCCTTCTGCCA"))
#> ('SPPGQAASEEAIKQITVLLP', 5.869999999999999)
print(translate("TGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACC"))
#> ('CASCPCWRCWPSGDLT', 5.824999999999999)
- See above how to create a DataFrame from a dict or from a CSV file, and how joins works. Also remember how to write a function that returns two values.
Exercise 4.4 (3 P)
You have used an online database to collect several properties of amino acids and downloaded the results as CSV file (/resources/python/data_ugly.csv
). Alas, when you have a closer look at the CSV file, you note that it looks somewhat strange (only the first ten rows are shown below):
# column 1: single-letter code
# column 2: name
# column 3: is the amino acid charged?
# column 4: isoelectric point
# column 5: for debugging
A;alanine;nope;6_0;-1
C;cysteine;nope;5_0;-1
D;aspartic acid;sure;3_0;-1
E;glutamic acid;sure;3_2;-1 F;phenylalanine;nope;5_5;-1
You identify several problems in this file:
- Columns are separated by semicolons (
;
). - There is no row with column names. Instead, the first five rows apparently contain comments that start with a hash sigh (
#
) and explain the contents of each column. - The third column describes whether an amino acid is charged or not. Although you expected this column to contain Boolean values, the database labeled charged amino acids with
sure
and uncharged amino acids withnope
. - The fourth column contains the isoelectric point. For unknown reasons, however, the underscore (
_
) is used as decimal separator! - The fifth column contains useless information and thus should be omitted.
Fortunately, you remember that Pandas’ read_csv()
function may be configured via optional parameters. Thus, you assume that this function may even be able to load this terrible CSV file. The loaded file should look as follows (only the top five rows are shown):
code name charged pi
0 A alanine False 6.0
1 C cysteine False 5.0
2 D aspartic acid True 3.0
3 E glutamic acid True 3.2 4 F phenylalanine False 5.5
Implement a function load_data()
that has a single parameter (the name of the input file) and returns the correctly formatted dataframe. In addition, your code should save the correctly formatted DataFrame to data_tidy.csv
.
Store your code in exercise_4_4.py
.
- Consult the Pandas documentation to learn which arguments you will need to specify.