Python’s Data Classes

When writing code, you have inevitably needed to store data throughout your pipeline. In these cases you store your value, list or data frame as a variable to easily use it elsewhere in your code. However, sometimes your data has an awkward form, consisting of a number of different length lists or data of different types and sizes. While it is still doable to work with, and using tuples or dictionaries can help, accessing different elements in your data quickly becomes messy and it is less intuitive what your code is actually doing.

To solve the above stated problem, data classes were introduced as a new feature in Python 3.7. A data class is a regular Python class, but with certain methods already implemented for you. This makes them easy to create and removes a lot of boilerplate (repeated code) making them simpler, more intuitive and pretty. Further, as data classes are part of the standard library, you can directly import it without needing to install any external dependencies (noice).

With the sales pitch out of the way, let us look at how we can use data classes.

from dataclasses import dataclass
from typing import Any

@dataclass
class Antibody:
    vgene: str
    jgene: None
    sequence: Any = 'EVQ'

In the above example, a data class for antibodies was created using the @dataclass decorator. The inside of a data class looks a little different from a normal class, but works, as mentioned, essentially the same way. For initialization, each variable which you initialize is written followed by a colon and then the type of the input. You need to define the data type, however, it is possible to allow any type with None or the typing Any. In cases where you have a default value, you can include this with an equal sign as seen with the ‘sequence’ variable.

antibody = Antibody(vgene='V3', jgene='J2', sequence='EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMGRVRRAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAKSSYYDILTGEFDYWGQGTLVTVSS') 

print(antibody.sequence)

You can now easily access the different variables of your antibody data class, as seen with the ‘antibody.sequence’ or add new variables of any type as seen below.

antibody.species = 'human' 
antibody.source = ['spleen', 'subject_1']

As mentioned, data classes are basically normal classes, and it is therefore possible to expand your data class by adding useful functions or properties, such as the numbering of your antibody sequence.

from dataclasses import dataclass
from typing import Any
import anarci

@dataclass
class Antibody:
    vgene: str
    jgene: str
    sequence: str

    @property
    def get_numbering(self):
        return anarci.run_anarci(self.sequence)[1][0][0][0]

Initializing the data class in the above code, will allow you to get the numbered sequence in the following elegant way.

antibody = Antibody('V3', 'J2', 'EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMGRVRRAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAKSSYYDILTGEFDYWGQGTLVTVSS')

print(antibody.get_numbering)

Data classes includes additional functionality such as choosing whether the data class is mutable, inherent other data classes, automatically derive different properties and store meta data. However, the really big benefit of data classes is the cleaner and more elegant code you can write.

Author