Some useful pandas functions | Oxford Protein Informatics Group

Pandas is one of the most used packages for data analysis in python. The library provides functionalities that allow to perfrom complex data manipulation operations in a few lines of code. However, as the number of functions provided is huge, it is impossible to keep track of all of them. More often than we’d like to admit we end up wiriting lines and lines of code only to later on discover that the same operation can be performed with a single pandas function.

To help avoiding this problem in the future, I will run through some of my favourite pandas functions and demonstrate their use on an example data set containing information of crystal structures in the PDB.

The data I will use is in a pandas dataframe format. The frame contains the ID of each structure deposited in the PDB as well as the method the structure was solved with and its resolution. The first few rows of the dataframe are printed below.

df.head(5)

pdb	resolution	structure_method
102l	1.74	x-ray diffraction
103l	1.9	x-ray diffraction
104l	2.8	x-ray diffraction
107l	1.8	x-ray diffraction
108l	1.8	x-ray diffraction

Query()

Using the query funtion you can filter a dataframe for rows that satisfy a boolean expression. Here, I am selecting rows of the dataframe that correspond to PDB structures solved by X-ray crystallography.

X_ray = df.query('structure_method == "x-ray diffraction"')
X_ray.head(2)

pdb	resolution	structure_method
102l	1.74	x-ray diffraction
103l	1.9	x-ray diffraction

This performs the same operation as the code below, but is usually much faster to type when using complex boolean expressions.

X_ray = df[df.structure_method == 'x-ray diffraction']

Describe()

The pandas describe function allows for quick calculation of some summary statistics of your data. The code below will return a summary of the resolution of PDB structures solved by X-ray christallography.

X_ray.descibe(percentiles=[.9])

By default the function computes the count, mean, std, min, max as well as the 25th, 50th and 75th percentile of all numeric columns of a dataframe. Percentiles shown can be customised as shown above.

	resolution
count	142143
mean	2.1301
std	0.577651
min	0.48
50%	2.03
90%	2.89
max	10

Apply()

This function allows you to apply a specified function to each row or column of a dataframe. Let’s say, for example, we want to have a look at the amino acid sequence of each PDB. Firstly, we need to define a function that takes a PDB code and returns the sequence of that structure. We can then easily apply this function to add a sequence column to the dataframe.

from Bio.PDB import PDBparser

def sequence_from_pdb(row):
    pdb = row.name
    file_path = '/path/to/file/{}.pdb'
    parser = PDBParser()
    structure = parser.get_structure('struc', file_path.format(pdb))
    residues = structure.get_residues()

    seq = []
    for residue in residues:
        seq.append(residue.get_resname())

    return seq

df['sequence'] = df.apply(sequence_from_pdb, axis=1)
df.head(2)

pdb	resolution	structure_method	sequence
102l	1.74	x-ray diffraction	[‘MET’, ‘ASN’, ‘ILE’, ‘PHE’, ‘GLU’, ‘MET’, ‘LEU’, ‘ARG’, ‘ILE’, ‘ASP’, ‘GLU’, ‘GLY’, ‘LEU’, ‘ARG’, ‘LEU’, ‘LYS’, ‘ILE’, ‘TYR’, ‘LYS’, ‘ASP’, ‘THR’, ‘GLU’, ‘GLY’, ‘TYR’, ‘TYR’, ‘THR’, ‘ILE’, ‘GLY’, ‘ILE’, ‘GLY’, ‘HIS’, ‘LEU’, ‘LEU’, ‘THR’, ‘LYS’, ‘SER’, ‘PRO’, ‘SER’, …]
103l	1.9	x-ray diffraction	[‘MET’, ‘ASN’, ‘ILE’, ‘PHE’, ‘GLU’, ‘MET’, ‘LEU’, ‘ARG’, ‘ILE’, ‘ASP’, ‘GLU’, ‘GLY’, ‘LEU’, ‘ARG’, ‘LEU’, ‘LYS’, ‘ILE’, ‘TYR’, ‘LYS’, ‘ASP’, ‘THR’, ‘GLU’, ‘GLY’, ‘TYR’, ‘TYR’, ‘THR’, ‘ILE’, ‘GLY’, ‘ILE’, ‘GLY’, ‘HIS’, ‘LEU’, ‘LEU’, ‘THR’, ‘SER’, ‘LEU’, ‘ASP’, ‘ALA’, …]

When applying functions to a large dataframe, we might want to keep track of the progress. The tqdm package offers a version of the apply function that prints a progress bar.

from tqdm import tqdm
tqdm.pandas()
df['sequence'] = df.progress_apply(sequence_from_pdb, axis=1)

Cut()/qcut()

The cut function bins an array of values into discrete intervals. When analysing the PDB data we might want to bin the structures into groups depending on their resolution. We define the edges of high, medium and low resolution bins and provide their labels. pd.cut() then returns the bin label for each structure.

bin_edges = [0,2,3.5,10]
bin_labels = ['high', 'medium', 'low']
X_ray['resolution_group'] = pd.cut(X_ray.resolution, bins=bin_edges,   labels=bin_labels)

X_ray.head(3)

pdb	resolution	structure_method	resolution_group
102l	1.74	x-ray diffraction	high
103l	1.9	x-ray diffraction	high
104l	2.8	x-ray diffraction	medium

The qcut function is very similar and, like cut, divides an array into discrete bins. However, instead of allowing the user to specify the boundaries of each interval qcut will choose them to create n bins of equal size.

pd.qcut(X_ray.resolution, q=3, labels=['high', 'medium', 'low']).value_counts()

	resolution
medium	48563
high	48178
low	45402

Author

Fabian Spoendlin

View all posts