Retrieving AlphaFold models from AlphaFoldDB

There are now nearly a million AlphaFold [1] protein structure predictions openly available via AlphaFoldDB [2]. This represents a huge set of new data that can be used for the development of new methods. The options for downloading structures are either in bulk (sorted by genome), or individually from the webpage for a prediction.

If you want just a few hundred or a few thousand specific structures, across different genomes, neither of these options are particularly practical. For example, if you have several thousand experimental structures for which you have their PDB [3] code, and you want to obtain the equivalent AlphaFold predictions, there is another way!

If we take the example of the PDB’s current molecule of the month, pyruvate kinase (PDB code 4FXF), this is how you can go about downloading the equivalent AlphaFold prediction programmatically.

  1. Query UniProt [4] for the corresponding accession number – an example python script is shown below:
import urllib
import urllib.parse
import urllib.request
from bs4 import BeautifulSoup

def get_uniprot (query='',query_type='PDB_ID'):
    #code found at <a href="https://chem-workflows.com/articles/2019/10/29/retrieve-uniprot-data-using-python/">https://chem-workflows.com/articles/2019/10/29/retrieve-uniprot-data-using-python/</a>
    #query_type must be: "PDB_ID" or "ACC"
    url = 'https://www.uniprot.org/uploadlists/' #This is the webser to retrieve the Uniprot data
    params = {
    'from':query_type,
    'to':'ACC',
    'format':'txt',
    'query':query
    }

    data = urllib.parse.urlencode(params)
    data = data.encode('ascii')
    request = urllib.request.Request(url, data)
    with urllib.request.urlopen(request) as response:
        res = response.read()
        page=BeautifulSoup(res).get_text()
        page=page.splitlines()
    return page

pdb_code = '4FXF'
query_output=get_uniprot(query=pdb_code,query_type='PDB_ID')
accession_number = query_output[1].strip().split(' ')[-1].strip(';')

2. Download the file containing UniProt accession IDs that have a AlphaFold prediction in AlphaFold DB using the ftp page (http://ftp.ebi.ac.uk/pub/databases/alphafold/) or from the terminal:

curl http://ftp.ebi.ac.uk/pub/databases/alphafold/accession_ids.txt -o accession_ids.txt

3. Check whether there is an AlphaFold prediction for the UniProt accession ID that you retrieved, and if there is, retrieve the AlphaFold ID for the prediction – an example python script is shown below:

accession_contents = open('accession_ids.txt', 'r').readlines()
accession_dict = {}
for i in accession_contents:
    accession_dict[i.split(',')[0]] = i.split(',')[3]

accession_id = 'P14618'
alphafold_id = None

if accession_id in accession_dict:
    alphafold_id = accession_dict[accession_id]

4. Finally, download the file by constructing the url for the predicted structure. The current version of the AlphaFoldDB is 2, however this is likely to change in the future. The predicted accuracy of the structure is also available for download . An example python script is shown below:

import os

alphafold_ID = 'AF-P14618-F1'
database_version = v2
model_url = f'https://alphafold.ebi.ac.uk/files/{alphafold_ID}-model_{database_version}.pdb
error_url = f'https://alphafold.ebi.ac.uk/files/{alphafold_ID}-predicted_aligned_error_{database_version}.json'

os.system(f'curl {model_url} -o {alphafold_id}.pdb')
os.system(f'curl {error_url} -o {alphafold_id}.json')

This can easily be turned into a loop that iterates over a list of PDB codes, and downloads those that have a corresponding AlphaFold prediction within AlphaFoldDB.

REFERENCES

[1] Jumper, J et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021)
[2] Varadi, M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research (2021)
[3] H.M. Berman, K. Henrick, H. Nakamura. Announcing the worldwide Protein Data Bank. Nature Structural Biology (2003)
[4] The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Research (2021)

Author