Exploring the Protein Data Bank programmatically | Oxford Protein Informatics Group

The Worldwide Protein Data Bank (wwPDB or just the PDB to its friends) is a key resource for structural biology, providing a single central repository of protein and nucleic acid structure data. Most researchers interact with the PDB either by downloading and parsing individual entries as mmCIF files (or as legacy PDB files), or by downloading aggregated data, such as the RCSB‘s collection in a single FASTA file of all polymer entity sequences. All too often, researchers end up laboriously writing their own file parsers to digest these files. In recent years though, more sophisticated tools have been made available that make it much easier to access only the data that you need.

Each of the wwPDB’s member organisations has some form of API to access the data they serve, each with different capabilities. The most mature and full-featured offering is that of the RCSB, which has multiple separate interfaces for accessing a wealth of different data and tools. It can seem quite complicated to the uninitiated but, for most cases where one would otherwise resort to reading mmCIF files, the search and data APIs are all you need.

The RCSB has put a lot of effort into making its APIs user-friendly, notwithstanding the labyrinthine data structures that they allow you to explore. You can, of course, interact with the APIs with a command-line HTTP tool like curl but the kind folks at the RCSB have also put together interactive browser-based query editors for the search and data APIs, giving you the advantages of auto-completion and syntax highlighting, to help you build the queries you need without necessarily being an expert in the mmCIF data schema. They even have a Python package, rcsb-api, making it really easy to read the PDB from within your own Python application. In the sections below, I will give a simple walk-through of some of the features of rcsb-api and how to get the most out of it.

Searching for structures

The search API is a REST interface that replicates the functionality of the search tools on the RCSB PDB website. In fact, when you build an advanced search on the website, you are actually constructing a query for the search API. After you submit a search, there is even a link to view, edit and rerun your query in the search API’s interactive query editor.

The RCSB's advanced search query builder tool, after submitting a search, with a link to the search API query editor. — The RCSB’s advanced search query builder tool, after submitting a search, with a link (green arrow) to the search API query editor.

Accessing the search API via the submodule rcsbapi.search from rcsb-api is quite intuitive and the execution is pretty performant. Different query types are available, including full text search, sequence similarity search, etc.. By default, the results of such a query are PDB entry IDs, though one can specify other PDB objects as the return type, such as polymer entities. Here is an attribute search for all PDB entries deposited since the beginning of 2025 and labelled as protein structures.

import datetime

from rcsbapi.search import search_attributes
from tqdm import tqdm

# Queries can be constructed from search attributes using comparison operators.
proteins = search_attributes.entity_poly.rcsb_entity_polymer_type == "Protein"
since_last_year = (
    search_attributes.rcsb_accession_info.revision_date
    >= datetime.date(2025, 1, 1).isoformat()
)

# Sub-queries can be combined using logical operators.
search_query = proteins & since_last_year

# Executing a query creates a `Session` instance that contains metadata about the
# response, like the number of results.
session = search_query()
# `Session` instances are also iterable, handlng pagination of responses.  By wrapping
# `session` with `tqdm`, we can view a progress bar as the pages of results are
# collected.
session_with_progress = tqdm(
    session, total=session.count, unit="entry", desc="Fetching PDB entries"
)

proteins_this_year = set(session_with_progress)

Tested today, that query returned 24,087 PDB entries and took less than two seconds to execute.

Fetching PDB entries: 100%|██████████| 24087/24087 [00:01<00:00, 15696.14entry/s]

Perhaps I need not have bothered with the progress bar! Now proteins_this_year is a set of PDB entry ID strings:

{'9VAK',
 '9D5Z',
 '9H4Z',
 '9N5X',
 '9BSS',
 ...}

Now that we have the IDs of the PDB entries we are interested in, what do we do with them? To extract the data from those entries, we need the data API.

Extracting only the data we need

Just to add to the muddle of different APIs and different ways of querying them, the RCSB actually provides two data APIs, not one. One is a REST interface, much like the search API, and the other is a more sophisticated GraphQL interface.

The REST data API

The REST API has access to some data that are not available via the GraphQL interface. For most applications, these extra endpoints are not of much value but if, perhaps, you maintain a fairly long-established database of antibody structures, it may be necessary periodically to check things like which PDB entries have been withdrawn. This information, and other ‘holdings’ data, is not available via the GraphQL API but it is available via the REST API.

import httpx

# The main REST data API path.
data_rest_endpoint = "https://data.rcsb.org/rest/v1/"

# Find the IDs of all entries ever to have been removed from the PDB.
query = httpx.get(data_rest_endpoint + "holdings/removed/entry_ids", timeout=3.05)
removed_entries = set(query.json())

As with the results of our search query, that gives us a set of PDB entry ID strings.

{'4D8Q',
 '5ZOM',
 '5E6L',
 '5BRC',
 '2IJS',
 ...}

Reassuringly, none of the protein entries deposited since the beginning of the year have had to be withdrawn yet!

proteins_this_year & removed_entries

gives us

set()

The GraphQL data API

Far more powerful, for targeted trawling of the PDB, is the GraphQL data API. GraphQL permits us to traverse the full hierarchy of PDB objects, extracting from each only the data that we specify, all in a single query. To get to grips with the GraphQL interface, without necessarily needing a strong grasp of GraphQL syntax or the mmCIF schema, the RCSB provides an interactive query editor, much like the one for the search API. In fact, when you look at the RCSB web page for a single PDB entry, several of the tabs contain links to the query editor, already populated with the query that was used to populate the data fields on that tab. This way, if you know that you want a subset of the data that is already displayed on the PDB entry page, you can whittle down the existing query to the data that you need.

For a practical example of using the GraphQL data API via rcsb-api, if we wanted to take our set of protein entries deposited since the start of 2025 and extract from each of the constituent polymer entities the entity ID and the canonical one-letter amino acid sequence, we could do so like this.

from rcsbapi.data import DataQuery

entries_query = DataQuery(
    input_type="entries",
    input_ids=list(proteins_this_year),
    return_data_list=[
        "polymer_entities.rcsb_id",
        "polymer_entities.entity_poly.pdbx_seq_one_letter_code_can",
    ],
)

seqs_by_entry = entries_query.exec(progress_bar=True)

Behind the scenes, this request is broken down into separate requests, processing the input entry IDs in batches, and rate limited to prevent abuse of the API servers. It takes on the order of a minute to complete this simple query.

100%|██████████| 81/81 [00:46<00:00,  1.74it/s]

The output from this is a many-layered dictionary of dictionaries and lists of dictionaries, describing the various data that we’ve requested and their position in the graph of the PDB object hierarchy. It is not necessarily intuitive to digest, but it is much less complicated than parsing a full mmCIF file. We can reduce it to a simple dictionary of polymer entity IDs and one-letter sequences like so. And, while we’re at it, let’s remove any polymer entities comprising fewer than 70 residues.

seqs: dict[str, str] = {
    entity["rcsb_id"]: seq
    for entry in seqs_by_entry["data"]["entries"]
    for entity in entry["polymer_entities"]
    if len(seq := entity["entity_poly"]["pdbx_seq_one_letter_code_can"]) >= 70
}

Now seqs has a value something like this.

{'8TQO_1': 'MAAPVVAPPGVVVSRANKRSGAGPGGSGGGGARGAEEEPPPPLQAVLVADSFDRRFFPISKDQPRVLLPLANVALIDYTLEFLTATGVQETFVFCCWKAAQIKEHLLKSKWCRPTSLNVVRIITSELYRSLGDVLRDVDAKALVRSDFLLVYGDVISNINITRALEEHRLRRKLEKNVSVMTMIFKESSPSHPTRCHEDNVVVAVDSTTNRVLHFQKTQGLRRFAFPLSLFQGSSDGVEVRYDLLDCHISICSPQVAQLFTDNFDYQTRDDFVRGLLVNEEILGNQIHMHVTAKEYGARVSNLHMYSAVCADVIRRWVYPLTPEANFTDSTTQSCTHSRHNIYRGPEVSLGHGSILEENVLLGSGTVIGSNCFITNSVIGPGCHIGDNVVLDQTYLWQGVRVAAGAQIHQSLLCDNAEVKERVTLKPRSVLTSQVVVGPNITLPEGSVISLHPPDAEEDEDDGEFSDDSGADQEKDKVKMKGYNPAEVGAAGKGYLWKAAGMNMEEEEELQQNLWGLKINMEEESESESEQSMDSEEPDSRGGSPQMDDIKVFQNEVLGTLQRGKEENISCDNLVLEINSLKYAYNVSLKEVMQVLSHVVLEFPLQQMDSPLDSSRYCALLLPLLKAWSPVFRNYIKRAADHLEALAAIEDFFLEHEALGISMAKVLMAFYQLEILAEETILSWFSQRDTTDKGQQLRKNQQLQRFIQWLKEAEEESSEDD',
 '8TQO_2': 'MEFQAVVMAVGGGSRMTDLTSSIPKPLLPVGNKPLIWYPLNLLERVGFEEVIVVTTRDVQKALCAEFKMKMKPDIVCIPDDADMGTADSLRYIYPKLKTDVLVLSCDLITDVALHEVVDLFRAYDASLAMLMRKGQDSIEPVPGQKGKKKAVEQRDFIGVDSTGKRLLFMANEADLDEELVIKGSILQKHPRIRFHTGLVDAHLYCLKKYIVDFLMENGSITSIRSELIPYLVRKQFSSASSQQGQEEKEEDLKKKELKSLDIYSFIKEANTLNLAPYDACWNACRGDRWEDLSRSQVRCYVHIMKEGLCSRVSTLGLYMEANRQVPKLLSALCPEEPPVHSSAQIVSKHLVGVDSLIGPETQIGEKSSIKRSVIGSSCLIKDRVTITNCLLMNSVTVEEGSNIQGSVICNNAVIEKGADIKDCLIGSGQRIEAKAKRVNEVIVGNDQLMEI',
 '8TQO_3': 'MHHHHHHGGGSENLYFQSPGSAAKGSELSERIESFVETLKRGGGPRSSEEMARETLGLLRQIITDHRWSNAGELMELIRREGRRMTAAQPSETTVGNMVRRVLKIIREEYGRLHGRSDESDQQESLHKLLTSGGLNEDFSFHYAQLQSNIIEAINELLVELEGTMENIAAQALEHIHSNEVIMTIGFSRTVEAFLKEAARKRKFHVIVAECAPFCQGHEMAVNLSKAGIETTVMTDAAIFAVMSRVNKVIIGTKTILANGALRAVTGTHTLALAAKHHSTPLIVCAPMFKLSPQFPNEEDSFHKFVAPEEVLPFTEGDILEKVSVHCPVFDYVPPELITLFISNIGGNAPSYIYRLMSELYHPDDHVL',
 '8TQO_4': 'MAAVAVAVREDSGSGMKAELPPGPGAVGREMTKEEKLQLRKEKKQQKKKRKEEKGAEPETGSAVSAAQCQVGPTRELPESGIQLGTPREKVPAGRSKAELRAERRAKQEAERALKQARKGEQGGPPPKASPSTAGETPSGVKRLPEYPQVDDLLLRRLVKKPERQQVPTRKDYGSKVSLFSHLPQYSRQNSLTQFMSIPSSVIHPAMVRLGLQYSQGLVSGSNARCIALLRALQQVIQDYTTPPNEELSRDLVNKLKPYMSFLTQCRPLSASMHNAIKFLNKEITSVGSSKREEEAKSELRAAIDRYVQEKIVLAAQAISRFAYQKISNGDVILVYGCSSLVSRILQEAWTEGRRFRVVVVDSRPWLEGRHTLRSLVHAGVPASYLLIPAASYVLPEVSKVLLGAHALLANGSVMSRVGTAQLALVARAHNVPVLVCCETYKFCERVQTDAFVSNELDDPDDLQCKRGEHVALANWQNHASLRLLNLVYDVTPPELVDLVITELGMIPCSSVPVVLRVKSSDQ',
 '8PX3_1': 'MDIDPYKEFGATVELLSFLPSDFFPSVRDLLDTASALYREALESPEHCSPHHTALRQAILCWGELMTLATWVGVNLEDPASRDLVVSYVNTNMGLKFRQLLWFHISCLTFGRETVIEYLVSFGVWIRTPPAYRPPNAPILSTLPETTVVRRRGRSPRRRTPSPRRRRSQSPRRRRSQSRESQC',
 ...}

Getting better performance from the data API

Even accounting for the necessary rate limit in submitting requests to the data API, I was puzzled as to why a fairly simple query should be taking as long as 45 seconds. We can delve a little into what is happening inside rcsbapi.data.DataQuery.exec. It is using asyncio, so I would expect it to be fairly performant, but it feels more sluggish than it should. Without digging into the details or profiling it properly, I have a suspicion that the way it implements the rate limiter using an asyncio.Semaphore may not be very efficient. To see if we can do better, lets cannibalise DataQuery.exec and implement our own stripped-down version, taking advantage of the httpx-limiter package to do the rate limiting instead.

For completeness, and in order that you can take this snippet of code and run it yourself as a stand-alone script, I’ve included the search API call and the holdings data query to the REST data API that we made earlier. If you have uv installed, you can just make this script executable and call it. The script is also available as a GitHub gist here.

#!/usr/bin/env -S uv run --script --no-project
#
# /// script
# requires-python = ">=3.12"
# dependencies = [
#   "httpx",
#   "httpx-limiter",
#   "rcsb-api",
#   "tqdm",
# ]
# ///

import asyncio
import datetime
from collections.abc import Iterable, Sized
from functools import partial
from itertools import batched
from typing import Any, Protocol, TypeVar

import httpx
from httpx_limiter.async_rate_limited_transport import AsyncRateLimitedTransport
from httpx_limiter.rate import Rate
from rcsbapi.config import config
from rcsbapi.const import const
from rcsbapi.data import DATA_SCHEMA
from rcsbapi.search import search_attributes
from tqdm import tqdm

T = TypeVar("T", covariant=True)


class SizedIterable(Iterable[T], Sized, Protocol):
    """Type spec for an iterable that also has length."""


# RCSB data REST API endpoint.
DATA_REST_ENDPOINT = "https://data.rcsb.org/rest/v1/"

# RCSB data GraphQL API rate limit.
rate = Rate.create(config.DATA_API_REQUESTS_PER_SECOND)
# RCSB data GraphQL API concurrent connection limits.
limits = httpx.Limits(
    max_connections=config.DATA_API_MAX_CONCURRENT_REQUESTS,
    max_keepalive_connections=config.DATA_API_MAX_CONCURRENT_REQUESTS,
)


def search_proteins_this_year() -> set[str]:
    """
    Get all protein entries published in the PDB this year.
    """
    # Query for protein entries.
    proteins = search_attributes.entity_poly.rcsb_entity_polymer_type == "Protein"
    # Query for this year's entries.
    this_year = datetime.date(datetime.date.today().year, 1, 1).isoformat()
    entries_this_year = search_attributes.rcsb_accession_info.revision_date >= this_year

    # Query for this year's protein entries.
    search_query = proteins & entries_this_year

    # Submit the query.
    session = search_query()
    # Collect the paginated response.
    session_with_progress = tqdm(
        session, total=session.count, unit="entries", desc="Fetching PDB entries"
    )

    return set(session_with_progress)


def removed_entries() -> set[str]:
    """
    Find the IDs of all entries ever to have been removed from the PDB.
    """
    query = httpx.get(DATA_REST_ENDPOINT + "holdings/removed/entry_ids", timeout=3.05)
    return set(query.json())


async def data_query(
    input_ids: list[str],
    input_type: str,
    return_data_list: list[str],
    client: httpx.AsyncClient,
) -> list[dict[str, Any]]:
    """
    A coroutine that constructs a GraphQL query and posts it to the data API server.

    args:
        input_type:   The PDB object type defining the entry level of the schema graph.
        input_ids:    A list of ID strings of objects to request.
        return_data:  The requested data nodes.
        client:       A `httpx` client context.
    """
    # We need not write our own GraphQL query — who has time to learn the syntax?
    # We can just use the method `DATA_SCHEMA.construct_query` instead.
    query = DATA_SCHEMA.construct_query(
        input_type=input_type,
        input_ids=input_ids,
        return_data_list=return_data_list,
    )
    # POST the query to the data API endpoint, using the RCSB default API timeout.
    response: httpx.Response = await client.post(
        const.DATA_API_ENDPOINT, json=query, timeout=config.API_TIMEOUT
    )
    # Check we have not received an HTTP error.
    response.raise_for_status()
    # Get the result of the query.
    result: dict[str, dict[str, list[dict[str, Any]]]] = response.json()
    # Raise any and all GraphQL errors as a single `ValueError`.
    if errors := result.get("errors"):
        raise ValueError(
            "\n".join(f"{i}: {e['message']}" for i, e in enumerate(errors, 1))
        )

    # Extract and return the requested result data.
    return result["data"][input_type]


async def data_query_with_limits(
    input_type: str,
    input_ids: SizedIterable[str],
    return_data_list: list[str],
) -> list[dict[str, Any]]:
    """
    A coroutine that posts a GraphQL query while respecting rate and size limits.

    Queries are split into subqueries, ensuring that each subquery requests no more
    input IDs than the API server's batch size limit.  Subqueries are staggered, to
    abide by the server's rate limit.

    args:
        input_type:   The PDB object type defining the entry level of the schema graph.
        input_ids:    A list of ID strings of objects to request.
        return_data:  The requested data nodes.
    """
    # Submit the `input_ids` in batches so as not to overwhelm the RCSB data API server.
    batches = map(list, batched(input_ids, config.DATA_API_BATCH_ID_SIZE))

    # Apply connection rate and concurrency limits.
    transport = AsyncRateLimitedTransport.create(rate)
    async with httpx.AsyncClient(transport=transport, limits=limits) as c:
        c.headers.update({"Accept": "application/json"})

        query = partial(
            data_query,
            input_type=input_type,
            return_data_list=return_data_list,
            client=c,
        )

        # A sort of asynchronous `map(query, batches)`.
        queries = (asyncio.create_task(query(batch)) for batch in batches)

        # Display a progress bar.
        with tqdm(
            total=len(input_ids),
            unit=input_type,
            desc=f"Fetching requested data from PDB {input_type}",
        ) as progress:
            results: list[dict[str, Any]] = []
            # Aggregate the results of all the queries, run concurrently within limits.
            for task in asyncio.as_completed(queries):
                result = await task
                results.extend(result)
                progress.update(len(result))

    return results


if __name__ == "__main__":
    # Get all PDB protein entries released this year and not yet removed.
    print("Finding this year's protein entries in the PDB.")
    proteins_this_year = search_proteins_this_year()
    proteins_this_year -= removed_entries()

    # Get entity ID and sequence for each polymer entity in this year's protein entries.
    return_data_list = [
        "polymer_entities.rcsb_id",
        "polymer_entities.entity_poly.pdbx_seq_one_letter_code_can",
    ]
    print("Getting the sequences of each polymer entity in each PDB entry.")
    main = data_query_with_limits("entries", proteins_this_year, return_data_list)
    seqs_by_entry = asyncio.run(main)

    seqs: dict[str, str] = {
        entity["rcsb_id"]: seq
        for entry in seqs_by_entry
        for entity in entry["polymer_entities"]
        if len(seq := entity["entity_poly"]["pdbx_seq_one_letter_code_can"]) >= 70
    }

That yields the same output as the rcsbapi.data.DataQuery approach, with roughly a 5× speed-up.

Finding this year's protein entries in the PDB.
Fetching PDB entries: 100%|██████████| 24087/24087 [00:01<00:00, 15481.79entries/s]
Getting the sequences of each polymer entity in each PDB entry.
Fetching requested data from PDB entries: 100%|██████████| 24087/24087 [00:08<00:00, 2725.26entries/s]

Author

Ben Williams

View all posts