{"id":13020,"date":"2025-09-11T09:28:03","date_gmt":"2025-09-11T08:28:03","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=13020"},"modified":"2025-09-11T14:50:39","modified_gmt":"2025-09-11T13:50:39","slug":"exploring-the-protein-data-bank-programmatically","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2025\/09\/exploring-the-protein-data-bank-programmatically\/","title":{"rendered":"Exploring the Protein Data Bank programmatically"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">The <a href=\"https:\/\/www.wwpdb.org\/\">Worldwide Protein Data Bank<\/a> (wwPDB or just the PDB to its friends) is a key resource for structural biology, providing a single central repository of protein and nucleic acid structure data.  Most researchers interact with the PDB either by downloading and parsing individual entries as <a href=\"https:\/\/pdb101.rcsb.org\/learn\/guide-to-understanding-pdb-data\/beginner%E2%80%99s-guide-to-pdbx-mmcif\">mmCIF files<\/a> (or as <a href=\"https:\/\/www.wwpdb.org\/documentation\/file-formats-and-the-pdb\">legacy PDB files<\/a>), or by downloading aggregated data, such as the <a href=\"https:\/\/www.rcsb.org\/\">RCSB<\/a>&#8216;s collection in a single FASTA file of <a href=\"https:\/\/www.rcsb.org\/docs\/programmatic-access\/file-download-services#sequence-data\">all polymer entity sequences<\/a>.  All too often, researchers end up laboriously writing their own file parsers to digest these files.  In recent years though, more sophisticated tools have been made available that make it much easier to access only the data that you need.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p class=\"wp-block-paragraph\">Each of the wwPDB&#8217;s member organisations has some form of API to access the data they serve, each with different capabilities.  The most mature and full-featured offering is that of the RCSB, which has <a href=\"https:\/\/www.rcsb.org\/docs\/programmatic-access\/web-apis-overview\">multiple separate interfaces<\/a> for accessing a wealth of different data and tools.  It can seem quite complicated to the uninitiated but, for most cases where one would otherwise resort to reading mmCIF files, the <em><a href=\"https:\/\/search.rcsb.org\/index.html\">search<\/a><\/em> and <em><a href=\"https:\/\/data.rcsb.org\/index.html\">data<\/a><\/em> APIs are all you need.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The RCSB has put a lot of effort into making its APIs user-friendly, notwithstanding the labyrinthine data structures that they allow you to explore.  You can, of course, interact with the APIs with a command-line HTTP tool like <code>curl<\/code> but the kind folks at the RCSB have also put together interactive browser-based query editors for the search and data APIs, giving you the advantages of auto-completion and syntax highlighting, to help you build the queries you need without necessarily being an expert in the mmCIF data schema.  They even have a Python package, <code><a href=\"https:\/\/rcsbapi.readthedocs.io\/en\/latest\/index.html\">rcsb-api<\/a><\/code>, making it really easy to read the PDB from within your own Python application.  In the sections below, I will give a simple walk-through of some of the features of <code>rcsb-api<\/code> and how to get the most out of it.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Searching for structures<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The search API is a REST interface that replicates the functionality of the search tools on the RCSB PDB website.  In fact, when you build an advanced search on the website, you are actually constructing a query for the search API.  After you submit a search, there is even a link to view, edit and rerun your query in the search API&#8217;s <a href=\"https:\/\/search.rcsb.org\/query-editor.html\">interactive query editor<\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/09\/Pasted-image-1.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"308\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/09\/Pasted-image-1.png?resize=625%2C308&#038;ssl=1\" alt=\"The RCSB's advanced search query builder tool, after submitting a search, with a link to the search API query editor.\" class=\"wp-image-13034\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/09\/Pasted-image-1.png?w=762&amp;ssl=1 762w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/09\/Pasted-image-1.png?resize=300%2C148&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/09\/Pasted-image-1.png?resize=624%2C308&amp;ssl=1 624w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><figcaption class=\"wp-element-caption\">The RCSB&#8217;s advanced search query builder tool, after submitting a search, with a link (green arrow) to the search API query editor.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Accessing the search API via the submodule <code>rcsbapi.search<\/code> from <code>rcsb-api<\/code> is quite intuitive and the execution is pretty performant.  Different query types are available, including full text search, sequence similarity search, etc..  By default, the results of such a query are PDB entry IDs, though one can specify <a href=\"https:\/\/rcsbapi.readthedocs.io\/en\/latest\/search_api\/query_construction.html#return-types\">other PDB objects as the return type<\/a>, such as polymer entities.  Here is an attribute search for all PDB entries deposited since the beginning of 2025 and labelled as protein structures.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import datetime\n\nfrom rcsbapi.search import search_attributes\nfrom tqdm import tqdm\n\n# Queries can be constructed from search attributes using comparison operators.\nproteins = search_attributes.entity_poly.rcsb_entity_polymer_type == \"Protein\"\nsince_last_year = (\n    search_attributes.rcsb_accession_info.revision_date\n    >= datetime.date(2025, 1, 1).isoformat()\n)\n\n# Sub-queries can be combined using logical operators.\nsearch_query = proteins &amp; since_last_year\n\n# Executing a query creates a `Session` instance that contains metadata about the\n# response, like the number of results.\nsession = search_query()\n# `Session` instances are also iterable, handlng pagination of responses.  By wrapping\n# `session` with `tqdm`, we can view a progress bar as the pages of results are\n# collected.\nsession_with_progress = tqdm(\n    session, total=session.count, unit=\"entry\", desc=\"Fetching PDB entries\"\n)\n\nproteins_this_year = set(session_with_progress)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Tested today, that query returned 24,087 PDB entries and took less than two seconds to execute.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">Fetching PDB entries: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 24087\/24087 [00:01&lt;00:00, 15696.14entry\/s]<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Perhaps I need not have bothered with the progress bar! Now <code>proteins_this_year<\/code> is a set of PDB entry ID strings:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">{'9VAK',\n '9D5Z',\n '9H4Z',\n '9N5X',\n '9BSS',\n ...}<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now that we have the IDs of the PDB entries we are interested in, what do we do with them?  To extract the data from those entries, we need the data API.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Extracting only the data we need<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Just to add to the muddle of different APIs and different ways of querying them, the RCSB actually provides <em>two<\/em> data APIs, not one.  One is a <a href=\"https:\/\/data.rcsb.org\/index.html#rest-api\">REST interface<\/a>, much like the search API, and the other is a more sophisticated <a href=\"https:\/\/data.rcsb.org\/index.html#gql-api\">GraphQL interface<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The REST data API<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The REST API has access to some data that are not available via the GraphQL interface.  For most applications, these extra endpoints are not of much value but if, perhaps, you maintain a fairly long-established <a href=\"https:\/\/opig.stats.ox.ac.uk\/webapps\/sabdab-sabpred\/sabdab\">database of antibody structures<\/a>, it may be necessary periodically to check things like which PDB entries have been withdrawn.  This information, and other &#8216;holdings&#8217; data, is not available via the GraphQL API but it is available via the REST API.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import httpx\n\n# The main REST data API path.\ndata_rest_endpoint = \"https:\/\/data.rcsb.org\/rest\/v1\/\"\n\n# Find the IDs of all entries ever to have been removed from the PDB.\nquery = httpx.get(data_rest_endpoint + \"holdings\/removed\/entry_ids\", timeout=3.05)\nremoved_entries = set(query.json())<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">As with the results of our search query, that gives us a set of PDB entry ID strings.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">{'4D8Q',\n '5ZOM',\n '5E6L',\n '5BRC',\n '2IJS',\n ...}<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Reassuringly, none of the protein entries deposited since the beginning of the year have had to be withdrawn yet!<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">proteins_this_year &amp; removed_entries<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">gives us<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">set()<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">The GraphQL data API<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Far more powerful, for targeted trawling of the PDB, is the GraphQL data API. GraphQL permits us to traverse the full hierarchy of PDB objects, extracting from each only the data that we specify, all in a single query. To get to grips with the GraphQL interface, without necessarily needing a strong grasp of GraphQL syntax or the mmCIF schema, the RCSB provides an <a href=\"https:\/\/data.rcsb.org\/graphql\/index.html\">interactive query editor<\/a>, much like the one for the search API. In fact, when you look at the RCSB web page for a single PDB entry, several of the tabs contain links to the query editor, already populated with the query that was used to populate the data fields on that tab. This way, if you know that you want a subset of the data that is already displayed on the PDB entry page, you can whittle down the existing query to the data that you need.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/09\/Pasted-image-2.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"285\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/09\/Pasted-image-2.png?resize=625%2C285&#038;ssl=1\" alt=\"The RCSB structure summary page for entry 12E8, with a link to the GraphQL data API query editor.\" class=\"wp-image-13057\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/09\/Pasted-image-2.png?w=988&amp;ssl=1 988w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/09\/Pasted-image-2.png?resize=300%2C137&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/09\/Pasted-image-2.png?resize=768%2C350&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/09\/Pasted-image-2.png?resize=624%2C284&amp;ssl=1 624w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><figcaption class=\"wp-element-caption\">The RCSB structure summary page for entry 12E8, with a link (green arrow) to the GraphQL data API query editor.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">For a practical example of using the GraphQL data API via <code>rcsb-api<\/code>, if we wanted to take our set of protein entries deposited since the start of 2025 and extract from each of the constituent polymer entities the entity ID and the canonical one-letter amino acid sequence, we could do so like this.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from rcsbapi.data import DataQuery\n\nentries_query = DataQuery(\n    input_type=\"entries\",\n    input_ids=list(proteins_this_year),\n    return_data_list=[\n        \"polymer_entities.rcsb_id\",\n        \"polymer_entities.entity_poly.pdbx_seq_one_letter_code_can\",\n    ],\n)\n\nseqs_by_entry = entries_query.exec(progress_bar=True)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Behind the scenes, this request is broken down into separate requests, processing the input entry IDs in batches, and rate limited to prevent abuse of the API servers.  It takes on the order of a minute to complete this simple query.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 81\/81 [00:46&lt;00:00,  1.74it\/s]<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The output from this is a many-layered dictionary of dictionaries and lists of dictionaries, describing the various data that we&#8217;ve requested and their position in the graph of the PDB object hierarchy. It is not necessarily intuitive to digest, but it is much less complicated than parsing a full mmCIF file. We can reduce it to a simple dictionary of polymer entity IDs and one-letter sequences like so. And, while we&#8217;re at it, let&#8217;s remove any polymer entities comprising fewer than 70 residues.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">seqs: dict[str, str] = {\n    entity[\"rcsb_id\"]: seq\n    for entry in seqs_by_entry[\"data\"][\"entries\"]\n    for entity in entry[\"polymer_entities\"]\n    if len(seq := entity[\"entity_poly\"][\"pdbx_seq_one_letter_code_can\"]) >= 70\n}\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now <code>seqs<\/code> has a value something like this.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">{'8TQO_1': 'MAAPVVAPPGVVVSRANKRSGAGPGGSGGGGARGAEEEPPPPLQAVLVADSFDRRFFPISKDQPRVLLPLANVALIDYTLEFLTATGVQETFVFCCWKAAQIKEHLLKSKWCRPTSLNVVRIITSELYRSLGDVLRDVDAKALVRSDFLLVYGDVISNINITRALEEHRLRRKLEKNVSVMTMIFKESSPSHPTRCHEDNVVVAVDSTTNRVLHFQKTQGLRRFAFPLSLFQGSSDGVEVRYDLLDCHISICSPQVAQLFTDNFDYQTRDDFVRGLLVNEEILGNQIHMHVTAKEYGARVSNLHMYSAVCADVIRRWVYPLTPEANFTDSTTQSCTHSRHNIYRGPEVSLGHGSILEENVLLGSGTVIGSNCFITNSVIGPGCHIGDNVVLDQTYLWQGVRVAAGAQIHQSLLCDNAEVKERVTLKPRSVLTSQVVVGPNITLPEGSVISLHPPDAEEDEDDGEFSDDSGADQEKDKVKMKGYNPAEVGAAGKGYLWKAAGMNMEEEEELQQNLWGLKINMEEESESESEQSMDSEEPDSRGGSPQMDDIKVFQNEVLGTLQRGKEENISCDNLVLEINSLKYAYNVSLKEVMQVLSHVVLEFPLQQMDSPLDSSRYCALLLPLLKAWSPVFRNYIKRAADHLEALAAIEDFFLEHEALGISMAKVLMAFYQLEILAEETILSWFSQRDTTDKGQQLRKNQQLQRFIQWLKEAEEESSEDD',\n '8TQO_2': 'MEFQAVVMAVGGGSRMTDLTSSIPKPLLPVGNKPLIWYPLNLLERVGFEEVIVVTTRDVQKALCAEFKMKMKPDIVCIPDDADMGTADSLRYIYPKLKTDVLVLSCDLITDVALHEVVDLFRAYDASLAMLMRKGQDSIEPVPGQKGKKKAVEQRDFIGVDSTGKRLLFMANEADLDEELVIKGSILQKHPRIRFHTGLVDAHLYCLKKYIVDFLMENGSITSIRSELIPYLVRKQFSSASSQQGQEEKEEDLKKKELKSLDIYSFIKEANTLNLAPYDACWNACRGDRWEDLSRSQVRCYVHIMKEGLCSRVSTLGLYMEANRQVPKLLSALCPEEPPVHSSAQIVSKHLVGVDSLIGPETQIGEKSSIKRSVIGSSCLIKDRVTITNCLLMNSVTVEEGSNIQGSVICNNAVIEKGADIKDCLIGSGQRIEAKAKRVNEVIVGNDQLMEI',\n '8TQO_3': 'MHHHHHHGGGSENLYFQSPGSAAKGSELSERIESFVETLKRGGGPRSSEEMARETLGLLRQIITDHRWSNAGELMELIRREGRRMTAAQPSETTVGNMVRRVLKIIREEYGRLHGRSDESDQQESLHKLLTSGGLNEDFSFHYAQLQSNIIEAINELLVELEGTMENIAAQALEHIHSNEVIMTIGFSRTVEAFLKEAARKRKFHVIVAECAPFCQGHEMAVNLSKAGIETTVMTDAAIFAVMSRVNKVIIGTKTILANGALRAVTGTHTLALAAKHHSTPLIVCAPMFKLSPQFPNEEDSFHKFVAPEEVLPFTEGDILEKVSVHCPVFDYVPPELITLFISNIGGNAPSYIYRLMSELYHPDDHVL',\n '8TQO_4': 'MAAVAVAVREDSGSGMKAELPPGPGAVGREMTKEEKLQLRKEKKQQKKKRKEEKGAEPETGSAVSAAQCQVGPTRELPESGIQLGTPREKVPAGRSKAELRAERRAKQEAERALKQARKGEQGGPPPKASPSTAGETPSGVKRLPEYPQVDDLLLRRLVKKPERQQVPTRKDYGSKVSLFSHLPQYSRQNSLTQFMSIPSSVIHPAMVRLGLQYSQGLVSGSNARCIALLRALQQVIQDYTTPPNEELSRDLVNKLKPYMSFLTQCRPLSASMHNAIKFLNKEITSVGSSKREEEAKSELRAAIDRYVQEKIVLAAQAISRFAYQKISNGDVILVYGCSSLVSRILQEAWTEGRRFRVVVVDSRPWLEGRHTLRSLVHAGVPASYLLIPAASYVLPEVSKVLLGAHALLANGSVMSRVGTAQLALVARAHNVPVLVCCETYKFCERVQTDAFVSNELDDPDDLQCKRGEHVALANWQNHASLRLLNLVYDVTPPELVDLVITELGMIPCSSVPVVLRVKSSDQ',\n '8PX3_1': 'MDIDPYKEFGATVELLSFLPSDFFPSVRDLLDTASALYREALESPEHCSPHHTALRQAILCWGELMTLATWVGVNLEDPASRDLVVSYVNTNMGLKFRQLLWFHISCLTFGRETVIEYLVSFGVWIRTPPAYRPPNAPILSTLPETTVVRRRGRSPRRRTPSPRRRRSQSPRRRRSQSRESQC',\n ...}<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Getting better performance from the data API<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Even accounting for the necessary rate limit in submitting requests to the data API, I was puzzled as to why a fairly simple query should be taking as long as 45 seconds. We can delve a little into what is happening inside <code><a href=\"https:\/\/github.com\/rcsb\/py-rcsb-api\/blob\/7e0fd8ac9ddd1e6e5db6addc82f39227cfdeb25c\/rcsbapi\/data\/data_query.py#L187-L202\">rcsbapi.data.DataQuery.exec<\/a><\/code>. It is using <code>asyncio<\/code>, so I would expect it to be fairly performant, but it feels more sluggish than it should. Without digging into the details or profiling it properly, I have a suspicion that the way it implements the rate limiter using an <code>asyncio.Semaphore<\/code> may not be very efficient. To see if we can do better, lets cannibalise <code>DataQuery.exec<\/code> and implement our own stripped-down version, taking advantage of the <a href=\"https:\/\/midnighter.github.io\/httpx-limiter\/stable\/\"><code>httpx-limiter<\/code> package<\/a> to do the rate limiting instead.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For completeness, and in order that you can take this snippet of code and run it yourself as a stand-alone script, I&#8217;ve included the search API call and the holdings data query to the REST data API that we made earlier.  If you have <code><a href=\"https:\/\/docs.astral.sh\/uv\/\">uv<\/a><\/code> installed, you can just make this script executable and call it.  The script is also available as a GitHub gist <a href=\"https:\/\/gist.github.com\/benjaminhwilliams\/14177da7024e8518af10cbebb40b987b\">here<\/a>.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">#!\/usr\/bin\/env -S uv run --script --no-project\n#\n# \/\/\/ script\n# requires-python = \">=3.12\"\n# dependencies = [\n#   \"httpx\",\n#   \"httpx-limiter\",\n#   \"rcsb-api\",\n#   \"tqdm\",\n# ]\n# \/\/\/\n\nimport asyncio\nimport datetime\nfrom collections.abc import Iterable, Sized\nfrom functools import partial\nfrom itertools import batched\nfrom typing import Any, Protocol, TypeVar\n\nimport httpx\nfrom httpx_limiter.async_rate_limited_transport import AsyncRateLimitedTransport\nfrom httpx_limiter.rate import Rate\nfrom rcsbapi.config import config\nfrom rcsbapi.const import const\nfrom rcsbapi.data import DATA_SCHEMA\nfrom rcsbapi.search import search_attributes\nfrom tqdm import tqdm\n\nT = TypeVar(\"T\", covariant=True)\n\n\nclass SizedIterable(Iterable[T], Sized, Protocol):\n    \"\"\"Type spec for an iterable that also has length.\"\"\"\n\n\n# RCSB data REST API endpoint.\nDATA_REST_ENDPOINT = \"https:\/\/data.rcsb.org\/rest\/v1\/\"\n\n# RCSB data GraphQL API rate limit.\nrate = Rate.create(config.DATA_API_REQUESTS_PER_SECOND)\n# RCSB data GraphQL API concurrent connection limits.\nlimits = httpx.Limits(\n    max_connections=config.DATA_API_MAX_CONCURRENT_REQUESTS,\n    max_keepalive_connections=config.DATA_API_MAX_CONCURRENT_REQUESTS,\n)\n\n\ndef search_proteins_this_year() -> set[str]:\n    \"\"\"\n    Get all protein entries published in the PDB this year.\n    \"\"\"\n    # Query for protein entries.\n    proteins = search_attributes.entity_poly.rcsb_entity_polymer_type == \"Protein\"\n    # Query for this year's entries.\n    this_year = datetime.date(datetime.date.today().year, 1, 1).isoformat()\n    entries_this_year = search_attributes.rcsb_accession_info.revision_date >= this_year\n\n    # Query for this year's protein entries.\n    search_query = proteins &amp; entries_this_year\n\n    # Submit the query.\n    session = search_query()\n    # Collect the paginated response.\n    session_with_progress = tqdm(\n        session, total=session.count, unit=\"entries\", desc=\"Fetching PDB entries\"\n    )\n\n    return set(session_with_progress)\n\n\ndef removed_entries() -> set[str]:\n    \"\"\"\n    Find the IDs of all entries ever to have been removed from the PDB.\n    \"\"\"\n    query = httpx.get(DATA_REST_ENDPOINT + \"holdings\/removed\/entry_ids\", timeout=3.05)\n    return set(query.json())\n\n\nasync def data_query(\n    input_ids: list[str],\n    input_type: str,\n    return_data_list: list[str],\n    client: httpx.AsyncClient,\n) -> list[dict[str, Any]]:\n    \"\"\"\n    A coroutine that constructs a GraphQL query and posts it to the data API server.\n\n    args:\n        input_type:   The PDB object type defining the entry level of the schema graph.\n        input_ids:    A list of ID strings of objects to request.\n        return_data:  The requested data nodes.\n        client:       A `httpx` client context.\n    \"\"\"\n    # We need not write our own GraphQL query \u2014 who has time to learn the syntax?\n    # We can just use the method `DATA_SCHEMA.construct_query` instead.\n    query = DATA_SCHEMA.construct_query(\n        input_type=input_type,\n        input_ids=input_ids,\n        return_data_list=return_data_list,\n    )\n    # POST the query to the data API endpoint, using the RCSB default API timeout.\n    response: httpx.Response = await client.post(\n        const.DATA_API_ENDPOINT, json=query, timeout=config.API_TIMEOUT\n    )\n    # Check we have not received an HTTP error.\n    response.raise_for_status()\n    # Get the result of the query.\n    result: dict[str, dict[str, list[dict[str, Any]]]] = response.json()\n    # Raise any and all GraphQL errors as a single `ValueError`.\n    if errors := result.get(\"errors\"):\n        raise ValueError(\n            \"\\n\".join(f\"{i}: {e['message']}\" for i, e in enumerate(errors, 1))\n        )\n\n    # Extract and return the requested result data.\n    return result[\"data\"][input_type]\n\n\nasync def data_query_with_limits(\n    input_type: str,\n    input_ids: SizedIterable[str],\n    return_data_list: list[str],\n) -> list[dict[str, Any]]:\n    \"\"\"\n    A coroutine that posts a GraphQL query while respecting rate and size limits.\n\n    Queries are split into subqueries, ensuring that each subquery requests no more\n    input IDs than the API server's batch size limit.  Subqueries are staggered, to\n    abide by the server's rate limit.\n\n    args:\n        input_type:   The PDB object type defining the entry level of the schema graph.\n        input_ids:    A list of ID strings of objects to request.\n        return_data:  The requested data nodes.\n    \"\"\"\n    # Submit the `input_ids` in batches so as not to overwhelm the RCSB data API server.\n    batches = map(list, batched(input_ids, config.DATA_API_BATCH_ID_SIZE))\n\n    # Apply connection rate and concurrency limits.\n    transport = AsyncRateLimitedTransport.create(rate)\n    async with httpx.AsyncClient(transport=transport, limits=limits) as c:\n        c.headers.update({\"Accept\": \"application\/json\"})\n\n        query = partial(\n            data_query,\n            input_type=input_type,\n            return_data_list=return_data_list,\n            client=c,\n        )\n\n        # A sort of asynchronous `map(query, batches)`.\n        queries = (asyncio.create_task(query(batch)) for batch in batches)\n\n        # Display a progress bar.\n        with tqdm(\n            total=len(input_ids),\n            unit=input_type,\n            desc=f\"Fetching requested data from PDB {input_type}\",\n        ) as progress:\n            results: list[dict[str, Any]] = []\n            # Aggregate the results of all the queries, run concurrently within limits.\n            for task in asyncio.as_completed(queries):\n                result = await task\n                results.extend(result)\n                progress.update(len(result))\n\n    return results\n\n\nif __name__ == \"__main__\":\n    # Get all PDB protein entries released this year and not yet removed.\n    print(\"Finding this year's protein entries in the PDB.\")\n    proteins_this_year = search_proteins_this_year()\n    proteins_this_year -= removed_entries()\n\n    # Get entity ID and sequence for each polymer entity in this year's protein entries.\n    return_data_list = [\n        \"polymer_entities.rcsb_id\",\n        \"polymer_entities.entity_poly.pdbx_seq_one_letter_code_can\",\n    ]\n    print(\"Getting the sequences of each polymer entity in each PDB entry.\")\n    main = data_query_with_limits(\"entries\", proteins_this_year, return_data_list)\n    seqs_by_entry = asyncio.run(main)\n\n    seqs: dict[str, str] = {\n        entity[\"rcsb_id\"]: seq\n        for entry in seqs_by_entry\n        for entity in entry[\"polymer_entities\"]\n        if len(seq := entity[\"entity_poly\"][\"pdbx_seq_one_letter_code_can\"]) >= 70\n    }<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">That yields the same output as the <code>rcsbapi.data.DataQuery<\/code> approach, with roughly a 5\u00d7 speed-up.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">Finding this year's protein entries in the PDB.\nFetching PDB entries: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 24087\/24087 [00:01&lt;00:00, 15481.79entries\/s]\nGetting the sequences of each polymer entity in each PDB entry.\nFetching requested data from PDB entries: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 24087\/24087 [00:08&lt;00:00, 2725.26entries\/s]<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Worldwide Protein Data Bank (wwPDB or just the PDB to its friends) is a key resource for structural biology, providing a single central repository of protein and nucleic acid structure data. Most researchers interact with the PDB either by downloading and parsing individual entries as mmCIF files (or as legacy PDB files), or by [&hellip;]<\/p>\n","protected":false},"author":137,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[29,361,341,296,14,228,221,227,15],"tags":[596,648],"ppma_author":[874],"class_list":["post-13020","post","type-post","status-publish","format-standard","hentry","category-code","category-data-science","category-databases","category-hints-and-tips","category-howto","category-protein-structure","category-python","category-python-code","category-technical","tag-pdb","tag-technical"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":874,"user_id":137,"is_guest":0,"slug":"ben","display_name":"Ben Williams","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/020ea273be8638c64bf77c36493144bb0116ead71fae7fa3c4f95093a9d81da9?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Williams","first_name":"Ben","job_title":"","description":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13020","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/137"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=13020"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13020\/revisions"}],"predecessor-version":[{"id":13086,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13020\/revisions\/13086"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=13020"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=13020"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=13020"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=13020"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}