Getting the PDB structures of compounds in ChEMBL

Recently I was dealing with a set of compounds with known target activities from the ChEMBL database, and I wanted to find out which of them also had PDB  crystal structures in complex with that target.

Referencing this manually is very easy for cases where we are interested in 2-3 compounds, but for any larger number, using the ChEMBL and PDB web services greatly reduces the number of clicks.

If we have the ChEMBL ID of a compound, the UniChem database (https://www.ebi.ac.uk/unichem/) can be used to cross-reference it with the PDB. A guide to the UniChem web services can be found here.

Let’s say we are interested in finding structures for Staurosporine, a notoriously promiscuous kinase inhibitor. Staurosporine’s ChEMBL ID is CHEMBL388978.

The UniChem API query to go from ChEMBL ID to PDB ligand ID then looks like:

https://www.ebi.ac.uk/unichem/rest/src_compound_id/CHEMBL388978/1/3

where the “1” after the ChEMBL ID indicates that this is a CheMBL ID, and the “3” after that – that a PDB ID output is requested. Using the Requests module in Python allows integrating such API calls with the rest of our analysis.

import requests
import json

def chembl_comp_to_pdb(chembl_compound_id):
    chembl_q = f"https://www.ebi.ac.uk/unichem/rest/src_compound_id/{chembl_compound_id}/1/3"
    chembl_res = requests.get(chembl_q)
    n = json.loads(chembl_res.text)
    print(n)
    if (type(n) is list) and (len(n)==0):
        return None
    else:
        return n[0]["src_compound_id"]

Supplying Staurosporine’s chembl ID to the above function reveals that its PDB ID is ‘STU’.

We can then use the PDB’s Web Services to retrieve structures where this ligand is present. Specifically, I used the PDB’s Search API, documentation and examples for which can be found here.

def get_pdb_entries_with_pdb_comp(pdb_comp_id):
    pdb_q = {"query": {
                "type": "terminal",
                "service": "text",
                "parameters": {
                  "attribute": "rcsb_nonpolymer_instance_feature_summary.comp_id",
                  "operator": "exact_match",
                  "value": ""
                }
              },
                "request_options": {"return_all_hits": True},
            
              "return_type": "entry"
            }
    pdb_q["query"]["parameters"]["value"] = pdb_comp_id
    pdb_res = requests.get(" https://search.rcsb.org/rcsbsearch/v1/query?json=" + json.dumps(pdb_q))
    
    if len(pdb_res.text) > 0:
        resp = json.loads(pdb_res.text)
        return [resp["result_set"][i]['identifier'] for i in range(len(resp['result_set'])) if resp["result_set"][i]['score'] == 1]

The service type is “text”, as we are making a text query using the compound’s ID. The attribute name is somewhat trickier to get, but all attributes are listed here (for structure-related queries), and here  ( for chemical queries). On the structural side, ligands are classified as ‘nonpolymer instance features’, and the comp_id is the attribute we want to query by (‘STU’ for Staurosporine).

The request options part of the query deals with pagination of the results. In the default case, only the first 10 results are returned, but all results can be returned by setting the ‘return_all_hits’ flag to True, as shown here: https://search.rcsb.org/#pagination.

The ‘score’ part of the return statement references the search API’s relevance score, which goes up to 1.0 (most relevant).

The above function returns a whopping 86 PDB structures for Staurosporine, which is not surprising, given its promiscuity within the kinase family of proteins. If we are interested only in complexes with a particular target protein, let’s say the kinase CDK2, the target’s UniProt ID can be used to filter the results. For CDK2, the UniProt identifier is “P24941”.

To query the PDB based on UniProt identifier,  I used the PDB’s GraphQL-based API, the documentation for which can be found here: https://data.rcsb.org/#gql-api

def match_uniprot_from_pdbids(pdb_ids, uniprot_id):
    pdb_str = "["
    for pdbid in pdb_ids[:-1]:
        pdb_str += f""" "{pdbid}","""
    pdb_str += f""" "{pdb_ids[-1]}"]"""

    pdb_q1 = """{
                  entries(entry_ids:""" + pdb_str + """){
                    polymer_entities {
                      rcsb_id
                      rcsb_polymer_entity_container_identifiers {
                        reference_sequence_identifiers {
                          database_accession
                          database_name
                        }
                      }
                    }
                  }
                }"""
    pdb_res = requests.get("https://data.rcsb.org/graphql?query=" + pdb_q1)
    m = json.loads(pdb_res.text)

    struct_list = []

    for pdbid in m['data']['entries']:
        for entity in pdbid['polymer_entities']: 
            pid = entity['rcsb_id']
            try:
                for db in entity['rcsb_polymer_entity_container_identifiers']['reference_sequence_identifiers']:
                    if db['database_name'] == 'UniProt':
                        uni_id = db['database_accession']
                        if uni_id == uniprot_id:
                                    struct_list.append(pid)

            except TypeError as e:
                print(e)
                
    return struct_list

Running the above shows that there are 4 PDB IDs featuring both human CDK2 and Staurosporine:

1AQ1, 4ERW, 4EZ7, 7NVQ

Note – what I have shown above are the solutions I arrived at after some Googling – do let me know if there are better ways to do this!

Author