I will not lie: I often struggle to find a snippet of code that did something in PyRosetta or I spend hours facing a problem caused by something not working as I expect it to. I recently did a tricky project involving RFdiffusion and I kept slipping on the PyRosetta side. So to make future me, others, and ChatGTP5 happy, here are some common operations to make working with PyRosetta for RFdiffusion easier.
Import
For easy copypasting, let’s start PyRosetta and import things. As said in previous post I am not keen on star imports, because they are not meant to be used indiscriminately and can cause issues —a common pythonic example is when someone mixes collections.Counter and typing.Counter.
import pyrosetta
import pyrosetta_help as ph
from types import ModuleType
# Better than star imports:
prc: ModuleType = pyrosetta.rosetta.core
prp: ModuleType = pyrosetta.rosetta.protocols
pru: ModuleType = pyrosetta.rosetta.utility
prn: ModuleType = pyrosetta.rosetta.numeric
pr_conf: ModuleType = pyrosetta.rosetta.core.conformation
pr_scoring: ModuleType = pyrosetta.rosetta.core.scoring
pr_res: ModuleType = pyrosetta.rosetta.core.select.residue_selector
pr_options: ModuleType = pyrosetta.rosetta.basic.options
logger = ph.configure_logger()
pyrosetta.distributed.maybe_init(extra_options=ph.make_option_string(no_optH=False,
ex1=None,
ex2=None,
ignore_unrecognized_res=True,
load_PDB_components=False,
ignore_waters=True,
)
)
Load from a PDB file:
pyrosetta.pose_from_file('👾👾👾.pdb')
or from string
pose = pyrosetta.Pose() prc.import_pose.pose_from_pdbstring(pose, 'ATOM 👾👾👾...')
or from a mmCIF file. Note this will not work straight from PyMol because an extra entry at end is needed as discussed previously. I should say working with mmCIF is uncommon, but does have some nice metadata handling advantages.
pose: pyrosetta.Pose = prc.import_pose.pose_from_file('👾👾👾.cif', read_fold_tree=True, type=prc.import_pose.FileType.CIF_file)
There are several other pose importers. Another useful one, for threading especially, is pyrosetta.pose_from_sequence is rather helpful —just watch out for the eye of Sauron when threading.
Extra complication happens when dealing with residue types that are not in the database. These need to be added to the pose before loading the file.
As always with PyRosetta, if you pass an illegal value, say prc.import_pose.pose_from_file(Exception) you’ll get a TypeError that will tell you the accepted arguments of the overloaded function, which sometimes differs from the help(fun) info.
A new residue type can be added from a params file or from memory. The official way to create a params file is from the Python 2.7 script from the Rosetta download, but that was not good for my purposes so I wrote a new converter, the rdkit_to_params module, which I also made into a web app due to popular demand.
A params file can be passed via the initialisation options (-extra_res_fa) or via the following:
pru: ModuleType = pyrosetta.rosetta.utility params_filenames: List[str] = ... pose: pyrosetta.Pose = ... params_vector = pru.vector1_string() params_vector.extend([f for f in params_filenames if f]) pyrosetta.generate_nonstandard_residue_set(pose, params_vector)
Having added a residue type does not mean a residue was added to the pose. For that you’d need to do:
prc: ModuleType = pyrosetta.rosetta.core resiset: prc.chemical.ResidueTypeSet = pyrosetta.generate_nonstandard_residue_set(pose, params_vector) new_res = prc.conformation.ResidueFactory.create_residue( resiset.name_map( name ) ) pose.append_residue_by_jump(new_res, pose.num_jump() + 1)
In the rdkit-to-params module, Params.add_residuetype adds a params but from a string basically.
def add_residuetype(self, pose: pyrosetta.Pose, params_block: str, name3: str, reset:bool=True) \
-> pyrosetta.rosetta.core.chemical.ResidueTypeSet:
"""
Adds the params to a copy of a residue type set of a pose.
If reset is True it will also save it as the default RTS —it keeps other custom residue types.
"""
rts = pose.conformation().modifiable_residue_type_set_for_conf(prc.chemical.FULL_ATOM_t)
buffer = pyrosetta.rosetta.std.stringbuf(params_block)
stream = pyrosetta.rosetta.std.istream(buffer)
new = prc.chemical.read_topology_file(stream,
name3,
rts)
rts.add_base_residue_type(new)
if reset:
pose.conformation().reset_residue_type_set_for_conf(rts)
return rts
Connections, either polymeric (LOWER and UPPER) or otherwise (CONN1 etc.) are a topic that would require its own blog post, so I won’t cover it here.
PDBInfo
PDB information (residue numbering, chain, occupancy and more) is stored in a pyrosetta.rosetta.core.pose.PDBInfo instance attached to the pose (accessible via .pdb_info() method, which returns the instance attached and not a copy).
This very handy. Two of its methods are pose2pdb(res=👾) and pdb2pose(chain=👾, res=👾) which allows one to jump between the two standards. A nice thing is that it does not segfault when a pose residue index that does not exist is called (RuntimeError). If the a PDB residue is request that does not exist it will return 0.
A parenthetical warning — one of its methods is .remarks(), which is a no go. In the PDB format, REMARK is basically a comment that has really strict tools-specific grammar and is a total mess. The .remarks() does not work in PyRosetta and segfaults even when the file loaded did indeed have REMARK lines. And passing a pyrosetta.rosetta.core.io.Remarks instance will not work nor will appending to a Remarks or even making one:
remarks = pyrosetta.rosetta.core.io.Remarks() remark = pyrosetta.rosetta.core.io.RemarkInfo() remark.value = 'Hello world' remarks.append(remark) # ValueError: vector
So were metadata needed to be added to a file, editing the PDB block might be a better bet —editing the mmCIF dictionary data is even betterer. In a pinch appending comment lines to a PDB helps too (but don’t tell anyone).
Certain operations (RemodelMover, grafting etc.) will make the PDB information obsolete or make it get lost.
The Pose class has a method split_by_chain which returns the chains (as defined in the fold tree) preserving the PDBInfo, but the function (external to pyrosetta.Pose) pyrosetta.rosetta.core.pose.append_pose_to_pose does not, so it needs correcting:
def add_chain(built: pyrosetta.Pose, new: pyrosetta.Pose, reset: bool = False) -> None:
"""
Add a chain ``new`` to a pose ``built`` preserving the residue numbering.
:param built: this is the pyrosetta.Pose that will be built into...
:param new: the addendum
:param reset: resets the PDBInfo for the chain present to A
:return:
"""
built_pi = built.pdb_info()
if built_pi is None or reset:
built_pi = prc.pose.PDBInfo(built)
built.pdb_info(built_pi)
for r in range(1, built.total_residue() + 1):
built_pi.set_resinfo(res=r, chain_id='A', pdb_res=r)
for chain in new.split_by_chain():
offset: int = built.total_residue()
pyrosetta.rosetta.core.pose.append_pose_to_pose(built, chain, new_chain=True)
# the new `built` residues will not have PDBinfo
chain_pi = chain.pdb_info()
for r in range(1, chain.total_residue() + 1):
built_pi.set_resinfo(res=r + offset, chain_id=chain_pi.chain(r), pdb_res=chain_pi.number(r))
built_pi.obsolete(False)
The functions pyrosetta.rosetta.protocols.grafting.delete_region or pyrosetta.rosetta.protocols.grafting.return_region do not impact the PDBinfo.
Given immutation start sequence one can fix the chain infomation thusly:
def fix_starts(pose: pyrosetta.Pose, chain_letters: str, start_seqs: List[str]):
"""
Fix the chains in place
.. code-block:: python
strep_seq = 'MEAGIT'
start_seqs = ['MKIYY', strep_seq, strep_seq, 'GEFAR', strep_seq, strep_seq, 'FKDET']
fix_starts(pose, chain_letters='ACDEFGB', start_seq=start_seq)
:param pose:
:param chain_letters:
:param start_seq: Confusingly, the first is ignored: the start of the pose is the start of the first chain.
:return:
"""
pi = pose.pdb_info()
seq = pose.sequence()
seq_iter = iter(start_seqs[1:]+[None])
chain_iter = iter(chain_letters)
start_idx = 1
while True:
this_chain = next(chain_iter)
next_seq = next(seq_iter)
if next_seq is None:
for i in range(start_idx, len(seq)+1):
pi.chain(i, this_chain)
break
else:
next_start = seq.find(next_seq, start_idx) + 1
for i in range(start_idx, next_start):
pi.chain(i, this_chain)
start_idx = next_start
pose.update_pose_chains_from_pdb_chains()
assert pose.num_chains() == len(chain_letters), f'{pose.num_chains()} != {len(chain_letters)}'
Threading set-up
The next steps address the fact, that the backbone skeleton lacks sidechains and needs to be threaded with a sequence generated with proteinMPNN. When one imports a pose with missing sidechains into PyRosetta, these get added. They are good, but not perfect, so copying the coordinates of the conserved sidechains is a must. A key move is superposition.
Superposition
First, a silly pedantic lexical side note. When one has a sequence of elements and one spaces out the elements so they match, that is called to align. When one places one object over another but letting one still be visible, say with a stagger like the traces in a joy plot or translating without rotating (e.g. ▽+△=✡︎) that is called to superimpose, when one rototranslates fully (e.g. ▽+△↺=▽), that is called to superpose. Just to go through both words, below I will the align sequences, to superposed structures. Although, in this specific case, the metadata has the information so best use that.
There are actually several functions that allow superpositions, including presets for CA atoms or all atoms and so forth. The above is customisable as you have to tell it what atom goes to which.
In PyRosetta one can superpose two poses with the following:
mobile: pyrosetta.Pose = ... ref: pyrosetta.Pose = ... atom_map: prs.map_core_id_AtomID_core_id_AtomID = ... # this is a hash-mapping of mobile `pyrosetta.AtomID` to reference `pyrosetta.AtomID` rmsd: float = pr_scoring.superimpose_pose(mod_pose=mobile, ref_pose=ref, atom_map=atom_map)
If pyrosetta.AtomID seems familiar, that is because for constraints one has to play around with them a lot. So where we to align two poses by the CA atoms of a specified PDB chain that ought to be common, we can do:
prc: ModuleType = pyrosetta.rosetta.core
prp: ModuleType = pyrosetta.rosetta.protocols
pru: ModuleType = pyrosetta.rosetta.utility
prn: ModuleType = pyrosetta.rosetta.numeric
prs: ModuleType = pyrosetta.rosetta.std
pr_conf: ModuleType = pyrosetta.rosetta.core.conformation
pr_scoring: ModuleType = pyrosetta.rosetta.core.scoring
pr_res: ModuleType = pyrosetta.rosetta.core.select.residue_selector
pr_options: ModuleType = pyrosetta.rosetta.basic.options
def superpose_pose_by_chain(pose, ref, chain: str, strict: bool=True) -> float:
"""
superpose by PDB chain letter
:param pose:
:param ref:
:param chain:
:return:
"""
atom_map = prs.map_core_id_AtomID_core_id_AtomID()
chain_sele: pr_res.ResidueSelector = pr_res.ChainSelector(chain)
r: int # reference pose residue number (Fortran)
m: int # mobile pose residue number (Fortran)
for r, m in zip(pr_res.selection_positions(chain_sele.apply(ref)),
pr_res.selection_positions(chain_sele.apply(pose))
):
if strict:
assert pose.residue(m).name3() == ref.residue(r).name3(), 'Mismatching residue positions!'
ref_atom = pyrosetta.AtomID(ref.residue(r).atom_index("CA"), r)
mobile_atom = pyrosetta.AtomID(pose.residue(m).atom_index("CA"), m)
atom_map[mobile_atom] = ref_atom
return pr_scoring.superimpose_pose(mod_pose=pose, ref_pose=ref, atom_map=atom_map)
So what operations move the pose? A lot. The main two to keep an eye out for are RFdiffusion itself and Relax and its pesky drift.
TRB file
Now that we know how to superpose two poses, we need to figure out which are the conserved ones. To determine what is conserved in a RFdiffusion run, the pickle .trb file is a good option, as it will have some good information. If this file was deleted or similar, then see alignment below. The designed protein, is termed the hallucination (technically the field is moving away from that term, but it is a nice one), while the parent template, the reference. The dictionary in the .trb file will have several keys some in the pattern [complex_/]con_[ref/hal]_[pdb_idx/idx0]. In RFdiffusion, one designs only chain A and all other chains become chain B: complex_ includes chain B. So to get the mapping of the conserved residues is as easy as dict(zip(trb['complex_con_ref_idx0'], trb['complex_con_hal_idx0'])). inpaint_seq is a sequence of booleans spanning the length of the complex.
Digression: Alignment
Sequence alignment comes into play when superposing protein by the conserved regions —outside of RFdiffusion with the trb file. For that Bio.Align.PairwiseAligner can be used (since pairwise2 was deprecated).
In the default settings, the character ‘-‘ in the input sequence is treated like a regular character and not a free gap. Were one hellbent to do so a custom substitution matrix would be needed.
However, if the poses to superpose differed by a span that any identity is coincidental, such as a redesigned part of a domain, then we need to strongly penalise mismatches and not penalise gap extensions.
from Bio.Align import PairwiseAligner, Alignment, substitution_matrices
def superpose_by_common_alignment(mobile: pyrosetta.Pose, ref: pyrosetta.Pose) -> float:
"""
Pairwise alignment of the sequences of the poses.
This time only spans that are common.
return (ref_index, mobile_index)
:param mobile:
:param ref:
:return:
"""
# ## align
aligner = PairwiseAligner()
# We don't want to get mismatches aligned, so no to:
# aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
aligner.internal_gap_score = -10
aligner.extend_gap_score = -0.01
aligner.end_gap_score = -0.01
ref_seq: str = ref.sequence()
pose_seq: str = mobile.sequence()
aln: Alignment = aligner.align(ref_seq, pose_seq)[0]
# like before but with an extra condition `ref_seq[t] == pose_seq[q]`
aln_map: Dict[int, int] = {t: q for t, q in zip(aln.indices[0], aln.indices[1]) if
q != -1 and t != -1 and ref_seq[t] == pose_seq[q]}
# ## make pyrosetta atom map
...
# return RMSD
return pr_scoring.superimpose_pose(mod_pose=mobile, ref_pose=ref, atom_map=atom_map)
In the case that one end does not matter in the sequence, there are attributes that control the left (N-terminal) an right (C-terminal) side gaps. For example, the attributes target_right_gap_score and target_right_extend_gap_score could be set to zero, which make the C-terminal difference an inkshed of gaps.
Were one to want align to homologues, this would be fine:
from Bio.Align import PairwiseAligner, Alignment, substitution_matrices
def superpose_by_alignment(mobile: pyrosetta.Pose, ref: pyrosetta.Pose) -> float:
"""
Pairwise alignment of the sequences of the poses.
return (ref_index, mobile_index)
:param mobile:
:param ref:
:return:
"""
# ## align
aligner = PairwiseAligner()
aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
ref_seq: str = ref.sequence()
pose_seq: str = mobile.sequence()
aln: Alignment = aligner.align(ref_seq, pose_seq)[0]
# `aln.indices` has the mapping
# an index of -1 is a map to a gap
aln_map: Dict[int, int] = {t: q for t, q in zip(aln.indices[0], aln.indices[1]) if
q != -1 and t != -1}
# ## make pyrosetta atom map
# for the purpose of explanation the following block is not in its own function, but will be repeated in a minute.
# where these all part of the same code, the repeated part would need to be its own function as it is very modular anyway!
atom_map = prs.map_core_id_AtomID_core_id_AtomID()
for r, m in aln_map.items():
ref_atom = pyrosetta.AtomID(ref.residue(r + offset).atom_index("CA"), r + offset)
mobile_atom = pyrosetta.AtomID(mobile.residue(m + offset).atom_index("CA"), m + offset)
atom_map[mobile_atom] = ref_atom
# ## superpose and return RMSD
return pr_scoring.superimpose_pose(mod_pose=mobile, ref_pose=ref, atom_map=atom_map)
Stealing sidechains
One critical detail to keep in mind is that the RFdiffusion/hallucination is a backbone skeleton without sidechains. When this is imported, PyRosetta will add the missing atoms, sequentially by residue. Intriguingly, the sidechains match somewhat where they should be, but the match is not perfect and the potential energy is worse (higher). As a result we need to steal the sidechains from the template. To do this we need to be careful with variable names as the superposition runs on the words ‘ref’ and ‘mobile’, while the trb metadata runs off ‘hal’ and ‘ref’.
def steal_frozen(acceptor: pyrosetta.Pose,
donor: pyrosetta.Pose, trb: Dict[str, Any],
move_acceptor: bool = False
):
"""
Copy all the conserved coordinates from the donor to the acceptor.
These are determined by ``trb`` dict from RFdiffusion.
This is similar to ``prp.comparative_modeling.StealSideChainsMover``,
but does not require the threading infrastructure.
The hallucination is the acceptor, the parent is the donor.
The RFdiffused pose is skeleton, but when imported the sidechains are added.
The theft is done in 3 steps.
1. A mapping of residue idx, atom idx to residue idx, atom idx is made.
2. The hallucination is superimposed on the parent if move_acceptor is True, else vice versa.
3. The coordinates are copied from the parent to the hallucination.
The term 'ref' gets confusing.
hallucination is fixed / mutanda, parent is mobile
fixed is called ref, but ref means parent for RFdiffusion so it is flipped)
:param acceptor:
:param donor:
:param trb:
:return:
"""
# ## Make mapping of all atoms of conserved residues
donor2acceptor_idx1s: Dict[Tuple[FTypeIdx, FTypeIdx], Tuple[FTypeIdx, FTypeIdx, str]] = {}
# these run off 0-based indices
for donor_res_idx0, acceptor_res_idx0 in zip(trb['complex_con_ref_idx0'], trb['complex_con_hal_idx0']):
donor_res_idx1 = donor_res_idx0 + 1
acceptor_res_idx1 = acceptor_res_idx0 + 1
acceptor_res = acceptor.residue(acceptor_res_idx1)
donor_res = donor.residue(donor_res_idx1)
assert donor_res.name3() == acceptor_res.name3(), f'donor {donor_res.name3()} != acceptor {acceptor_res.name3()}'
mob_atomnames = [donor_res.atom_name(ai1) for ai1 in range(1, donor_res.natoms() + 1)]
for fixed_atm_idx1 in range(1, acceptor_res.natoms() + 1):
aname = acceptor_res.atom_name(fixed_atm_idx1) # key to map one to other: overkill bar for HIE/HID
if aname not in mob_atomnames:
print(f'Template residue {donor_res.annotated_name()}{donor_res_idx1} lacks atom {aname}')
continue
donor_atm_idx1 = donor_res.atom_index(aname)
donor2acceptor_idx1s[(donor_res_idx1, donor_atm_idx1)] = (acceptor_res_idx1, fixed_atm_idx1, aname)
# ## Superpose
atom_map = prs.map_core_id_AtomID_core_id_AtomID()
if move_acceptor:
mobile: pyrosetta.Pose = acceptor
fixed: pyrosetta.Pose = donor
else:
mobile: pyrosetta.Pose = donor
fixed: pyrosetta.Pose = acceptor
for (donor_res_idx1, donor_atm_idx1), (acceptor_res_idx1, acceptor_atm_idx1, aname) in donor2acceptor_idx1s.items():
if move_acceptor:
mob_res_idx1, mob_atm_idx1 = acceptor_res_idx1, acceptor_atm_idx1
fixed_res_idx1, fixed_atm_idx1 = donor_res_idx1, donor_atm_idx1
else:
mob_res_idx1, mob_atm_idx1 = donor_res_idx1, donor_atm_idx1
fixed_res_idx1, fixed_atm_idx1 = acceptor_res_idx1, acceptor_atm_idx1
if aname.strip() not in ('N', 'CA', 'C', 'O'): # BB superpositon
continue
fixed_atom = pyrosetta.AtomID(fixed_atm_idx1, fixed_res_idx1)
mobile_atom = pyrosetta.AtomID(mob_atm_idx1, mob_res_idx1)
atom_map[mobile_atom] = fixed_atom
rmsd = prc.scoring.superimpose_pose(mod_pose=mobile, ref_pose=fixed, atom_map=atom_map)
# I am unsure why this is not near zero but around 0.1–0.3
assert rmsd < 1, f'RMSD {rmsd} is too high'
# ## Copy coordinates
to_move_atomIDs = pru.vector1_core_id_AtomID()
to_move_to_xyz = pru.vector1_numeric_xyzVector_double_t()
for (donor_res_idx1, donor_atm_idx1), (acceptor_res_idx1, acceptor_atm_idx1, aname) in donor2acceptor_idx1s.items():
# if aname in ('N', 'CA', 'C', 'O'): # BB if common
# continue
# this does not stick: fixed_res.set_xyz( fixed_ai1, mob_res.xyz(mob_ai1) )
to_move_atomIDs.append(pyrosetta.AtomID(acceptor_atm_idx1, acceptor_res_idx1))
to_move_to_xyz.append(donor.residue(donor_res_idx1).xyz(donor_atm_idx1))
acceptor.batch_set_xyz(to_move_atomIDs, to_move_to_xyz)
# ## Fix HIE/HID the brutal way
v = prc.select.residue_selector.ResidueNameSelector('HIS').apply(acceptor)
relax = prp.relax.FastRelax(pyrosetta.get_score_function(), 1)
movemap = pyrosetta.MoveMap()
movemap.set_bb(False)
movemap.set_chi(v)
movemap.set_jump(False)
relax.apply(acceptor)
return rmsd, donor2acceptor_idx1s
Once in a while, I get an email asking me about HIE/HID, because I mention it somewhere, so I am reticent to address a wee bit of code there: the HIS tautomerisation. In Rosetta, the HID tautomer is a patch on HIS, (residue.annotated_name() == 'H[HIS_D]'. For a specific change one can mutate the residue to alanine, then the patched/unpatched residue (cf. §Thread). There are tasks that can added to packers to sample the his tautomers / HNQ flips, but the laziest way is to let Relax deal with it.
Thread
In PyRosetta one can mutate a residue via
prp: ModuleType = pyrosetta.rosetta.protocols prp.simple_moves.MutateResidue(1, 'ALA').apply(pose)
This allows mutations to not only base residues, but also patched residues, such as NtermProteinFull (with extra N-terminal proton) or acetylated etc. To see if in the database folder there is what you are after it is honestly easier simply looking in the folder:
import pyrosetta, itertools
from pathlib import Path
from itertools import chain
fa = 'database/chemical/residue_type_sets/fa_standard'
# residue types
print([p.name for p in (Path(pyrosetta.__file__).parent / fa / 'residue_types' ).glob('*/*.params')])
# params types
patches_folder = (Path(pyrosetta.__file__).parent / fa / 'patches' )
print([p.name for p in itertools.chain(patches_folder.glob('*.txt'), patches_folder.glob('*/*.txt'))])
Just be aware, that MutateResidue segfaults on error, so try the patch first and it is fiddly at first.
To remove the terminus patches one can do the following:
import pyrosetta
prc: ModuleType = pyrosetta.rosetta.core
pru: ModuleType = pyrosetta.rosetta.utility
def remove_terminus_patches(pose):
clean_template_pose = template_pose.clone()
prc.pose.remove_nonprotein_residues(pose)
### find
lowers = pru.vector1_std_pair_unsigned_long_protocols_sic_dock_Vec3_t()
uppers = pru.vector1_std_pair_unsigned_long_protocols_sic_dock_Vec3_t()
prc.sic_dock.get_termini_from_pose(pose, lowers, uppers)
### remove
for upper, _ in uppers:
prc.conformation.remove_upper_terminus_type_from_conformation_residue(clean_template_pose.conformation(), upper)
for lower, _ in lowers:
prc.conformation.remove_lower_terminus_type_from_conformation_residue(clean_template_pose.conformation(), lower)
However, mutating a whole pose based on a sequence is a pillar of homology modelling, namely threading. I have discussed in an other post threading and hybridisation in PyRosetta and most examples here are simply ported from a helper module of mine, so I’ll be brief. Threading in Rosetta requires a “Grishin file”, which is a specific form of pairwise sequence alignment file. The mover is ThreadingMover with the aid of StealSideChainsMover, namely:
prc: ModuleType = pyrosetta.rosetta.core prp: ModuleType = pyrosetta.rosetta.protocols clean_template_pose = remove_terminus_patches(pose) # no termini from earlier! target_pose = pyrosetta.Pose() prc.pose.make_pose_from_sequence(target_pose, target_sequence, 'fa_standard') ## Thread align = prc.sequence.read_aln(format='grishin', filename=aln_file)[1] threader = prp.comparative_modeling.ThreadingMover(align=align, template_pose=clean_template_pose) threader.apply(target_pose) ## Steal sidechains qt = threader.get_qt_mapping(target_pose) steal = prp.comparative_modeling.StealSideChainsMover(clean_template_pose, qt) steal.apply(target_pose)
The main point of the aforementioned post about threading was loop modelling by cannibalising AF2, which can be done with the following settings before using the ThreadingMover mover:
# optional set for loop modelling from reference poses
poses: List[pyrosetta.Pose] = ... # poses to cannibilise for loops
lengths: List[int] = [3,] # 3,6,9 are traditional choices
fragment_sets = pru.vector1_std_shared_ptr_core_fragment_FragSet_t(len(lengths))
for i, l in enumerate(lengths):
fragment_sets[i+1] = prc.fragment.ConstantLengthFragSet(l)
for pose in poses:
prc.fragment.steal_constant_length_frag_set_from_pose(pose, fragment_sets[i+1])
threader = ...
threader.build_loops(True)
threader.randomize_loop_coords(True) # default
threader.frag_libs(fragment_sets)
threader.apply(...)
As may be apparent, the ThreadingMover does not like non-base residues. So for ligand residues, one can just use append_subpose_to_pose function.
prc: ModuleType = pyrosetta.rosetta.core
pr_res: ModuleType = pyrosetta.rosetta.core.select.residue_selector
def steal_ligands(donor_pose, acceptor_pose) -> None:
"""
Steals non-Protein residues from donor_pose and adds them to acceptor_pose
Do not use with nucleic acid polymers.
:param donor_pose:
:param acceptor_pose:
:return:
"""
PROTEIN = prc.chemical.ResidueProperty.PROTEIN
prot_sele = pr_res.ResiduePropertySelector(PROTEIN)
not_sele = pr_res.NotResidueSelector(prot_sele)
rv = pr_res.ResidueVector(not_sele.apply(donor_pose))
# if it were DNA...
# for from_res, to_res in ph.rangify(rv):
# prc.pose.append_subpose_to_pose(acceptor_pose, donor_pose, from_res, to_res, True)
for res in rv:
prc.pose.append_subpose_to_pose(acceptor_pose, donor_pose, res, res, True)
Now back to threading. As mentioned, I just use my helper function to deal with the above:
import pyrosetta_help
def thread(template_block, target_seq, target_name, template_name,
temp_folder='/tmp'):
# load template
template = pyrosetta.Pose()
prc.import_pose.pose_from_pdbstring(template, template_block)
# thread
aln_filename = f'{temp_folder}/{template_name}-{target_name}.grishin'
ph.write_grishin(target_name=target_name,
target_sequence=target_seq,
template_name=template_name,
template_sequence=template.sequence(),
outfile=aln_filename
)
aln: prc.sequence.SequenceAlignment = prc.sequence.read_aln(format='grishin', filename=aln_filename)[1]
threaded: pyrosetta.Pose
threader: prp.comparative_modeling.ThreadingMover
threadites: pru.vector1_bool
threaded, threader, threadites = ph.thread(target_sequence=target_seq,
template_pose=template,
target_name=target_name,
template_name=template_name,
align=aln
)
# no need to superpose. It is already aligned
# superpose(template, threaded)
# fix pdb info
n = threaded.total_residue()
pi = prc.pose.PDBInfo(n)
for i in range(1, n + 1):
pi.number(i, i)
pi.chain(i, 'A')
threaded.pdb_info(pi)
return threaded
A pythonic parenthesis. Names of callable objects are, by good practice, verbs, however stuff gets confusing when a word can be both a noun and a verb: I have tripped up so often with thread and fragment. Hence the use of the past particle in the above. For fragment, which has initial-stress derivation I naughtily use fràgment and fragm. ènt
Relax
The workhorse of Rosetta is without question the FastRelax mover. It is very customisable, so most often there is no near to play with packers, Monte Carlo samplers etc. This is true for design mode. I won’t go into detail on the basics as this is covered in every tutorial.
scorefxn: pr_scoring.ScoreFunction = pyrosetta.get_fa_scorefxn()
# don't forget to set weights!
scorefxn.set_weight(pr_scoring.ScoreType.atom_pair_constraint, 3)
scorefxn.set_weight(pr_scoring.ScoreType.coordinate_constraint, 5)
cycles = 3 # more than 5 is probably not going to do much
relax = prp.relax.FastRelax(scorefxn, cycles)
# set movemap to freeze residues
# say repack the sidechains of chain A and its neighbours
# allow backbone movements for chain A only
movemap = pyrosetta.MoveMap()
rs: ModuleType = prc.select.residue_selector
chainA_sele: rs.ResidueSelector = rs.ChainSelector('A')
chainA: pru.vector1_bool = chainA_sele.apply(pose)
neigh_sele: rs.ResidueSelector = rs.NeighborhoodResidueSelector(chainA_sele, True, distance)
neighs: pru.vector1_bool = neigh_sele.apply(pose)
movemap.set_chi(neighs)
movemap.set_bb(chainA)
movemap.set_jump(False)
relax.set_movemap(movemap)
# go!
replax.apply(pose)
In the case of RFdiffusion the template ought to have been minimised otherwise either the scores will be for which case was minimised better if the other chains are also minimised (don’t) or the scores will be for cases where the designed binder interacts or not with a spuriously strained part of the binding partner.
In the above example the neighbourhood was moved as an example. If you do repack these sidechains, which is reasonable, do make sure to keep the complex and the original binding partner will no longer work.

https://michelanglo.sgc.ox.ac.uk/r/christmas
A cool thing once can do with PyRosetta is use electron design as a constraint (tutorial) —theoretically one could used RFdiffusion via blender API, gemmi and numpy manupulations to make custom shaped protein, but that would be for fun and not actually relevant here.
If there are covalent bond lengths that need fixing, the we have do it cartesian:
constraint_weight = 5
scorefxn: pr_scoring.ScoreFunction = pyrosetta.create_score_function('ref2015_cart')
scorefxn.set_weight(pr_scoring.ScoreType.coordinate_constraint, constraint_weight)
scorefxn.set_weight(pr_scoring.ScoreType.angle_constraint, constraint_weight)
scorefxn.set_weight(pr_scoring.ScoreType.atom_pair_constraint, constraint_weight)
relax = pyrosetta.rosetta.protocols.relax.FastRelax(scorefxn, 3)
relax.cartesian(True)
relax.minimize_bond_angles(True)
relax.minimize_bond_lengths(True)
relax.apply(pose)
In both example I allude to constraints. These are well covered and are set up in various way for example, given the pose index and atom name of the atoms we can do:
def constrain_distance(pose, fore_idx, fore_name, aft_idx, aft_name, x0_in=1.334, sd_in=0.2, tol_in=0.02, weight=1):
AtomPairConstraint = pr_scoring.constraints.AtomPairConstraint # noqa
fore = pyrosetta.AtomID(atomno_in=pose.residue(fore_idx).atom_index(fore_name),
rsd_in=fore_idx)
aft = pyrosetta.AtomID(atomno_in=pose.residue(aft_idx).atom_index(aft_name),
rsd_in=aft_idx)
fun = pr_scoring.func.FlatHarmonicFunc(x0_in=x0_in, sd_in=sd_in, tol_in=tol_in)
if weight != 1:
fun = pr_scoring.func.ScalarWeightedFunc(weight, fun)
con = AtomPairConstraint(fore, aft, fun)
pose.add_constraint(con)
return con
def constrain_angle(pose, fore_idx, fore_name, mid_name, mid_idx, aft_idx, aft_name, x0_in=109/180*3.14, sd_in=10/180*3.14, weight=1):
AngleConstraint = pr_scoring.constraints.AngleConstraint
fore = pyrosetta.AtomID(atomno_in=pose.residue(fore_idx).atom_index(fore_name),
rsd_in=fore_idx)
mid = pyrosetta.AtomID(atomno_in=pose.residue(mid_idx).atom_index(mid_name),
rsd_in=mid_idx)
aft = pyrosetta.AtomID(atomno_in=pose.residue(aft_idx).atom_index(aft_name),
rsd_in=aft_idx)
fun = pr_scoring.func.CircularHarmonicFunc(x0_radians=x0_in, sd_radians=sd_in)
if weight != 1:
fun = pr_scoring.func.ScalarWeightedFunc(weight, fun)
con = AngleConstraint(fore, mid, aft, fun)
pose.add_constraint(con)
return con
def constrain_position(pose: pyrosetta.Pose, target_index: int, ref_index: int, x0_in=0., sd_in=0.01):
ref_ca = pyrosetta.AtomID(atomno_in=pose.residue(ref_index).atom_index('CA'), rsd_in=ref_index)
target_ca = pyrosetta.AtomID(atomno_in=pose.residue(target_index).atom_index('CA'), rsd_in=frozen_index)
target_xyz = pose.residue(frozen_index).xyz(target_ca.atomno())
fun = pr_scoring.func.HarmonicFunc(x0_in=x0_in, sd_in=sd_in)
if weight != 1:
fun = pr_scoring.func.ScalarWeightedFunc(weight, fun)
con = pr_scoring.constraints.CoordinateConstraint(a1=target_ca, fixed_atom_in=ref_ca, xyz_target_in=target_xyz, func=fun, scotype=pr_scoring.ScoreType.coordinate_constraint)
pose.add_constraint(con)
def constrain_peptide_gap(pose, chain_break, x0_in=1.334, sd_in=0.2, tol_in=0.02):
"""
Close a gap between
"""
AtomPairConstraint = pr_scoring.constraints.AtomPairConstraint # noqa
fore_c = pyrosetta.AtomID(atomno_in=pose.residue(chain_break).atom_index('C'),
rsd_in=chain_break)
aft_n = pyrosetta.AtomID(atomno_in=pose.residue(chain_break + 1).atom_index('N'),
rsd_in=chain_break + 1)
fun = pr_scoring.func.FlatHarmonicFunc(x0_in=x0_in, sd_in=sd_in, tol_in=tol_in)
con = AtomPairConstraint(fore_c, aft_n, fun)
pose.add_constraint(con)
One thing to keep an eye out for is how healthy are the constraints after minimisation:
def show_cons(pose):
"""
print the score for the various constraints
"""
get_atomname = lambda atomid: pose.residue(atomid.rsd()).atom_name(atomid.atomno()).strip()
get_description = lambda atomid: f'{pose.residue(atomid.rsd()).name3()}{pi.pose2pdb(atomid.rsd()).strip().replace(" ", ":")}.{get_atomname(atomid)}'
for con in pose.constraint_set().get_all_constraints():
if con.type() == 'AtomPair':
print(con.type(), get_description(con.atom1()), get_description(con.atom2()), con.score(pose))
elif con.type() == 'Angle':
print(con.type(), get_description(con.atom1()), get_description(con.atom2()), get_description(con.atom3()), con.score(pose))
else:
print(con.type(), con.score(pose))
Design
One can do design with FastRelax. Due to the way the mover samples, it fixes sub-optimal residues one by one as opposed to large scale remodelling —for that Remodel might be better suitable (tutorial). Design is really useful for RFdiffusion as the latter can in essence be thought as coarse-grain so there may be some residue combinations that are subpar. Obviously, when FastDesign choses a residue it is based on the scorefunction, which is imperfect: a quirky example of this is its proclivity to change certain residues in Streptag when bound to streptavidin even though the former is the product of rounds and rounds of phage display.
To run design one needs to create a task factory for it onto which restrictions (operations) are pushed. Confusingly, a default task factory designs everything, while the default task factory in relax does not. The ProhibitSpecifiedBaseResidueTypes is really handy for preventing cysteine from making life hard and also for correcting surface charges —protein are not very soluble if the isoelectric point is close to pH 7.4 or aggregate if there are exposed patches. On the latter note, proteinMPNN when using the vanilla weights potentially might make inadvertently a transmembrane helix, so that is something to keep an eye out for.
def create_design_tf(pose:pyrosetta.Pose, design_sele: pr_res.ResidueSelector, distance:int) -> prc.pack.task.TaskFactory:
"""
Create design task factory for relax.
Designs the ``design_sele`` and repacks around ``distance`` of it.
Remember to do
... code-block:: python
relax.set_enable_design(True)
relax.set_task_factory(task_factory)
"""
# design is default, so this is not done:
# residues_to_design = design_sele.apply(pose)
# design_ops = prc.pack.task.operation.OperateOnResidueSubset(????, residues_to_design)
no_cys = pru.vector1_std_string(1)
no_cys[1] = 'CYS'
no_cys_ops = prc.pack.task.operation.ProhibitSpecifiedBaseResidueTypes(no_cys)
# No design, but repack
repack_sele = pr_res.NeighborhoodResidueSelector(design_sele, distance, False)
residues_to_repack = repack_sele.apply(pose)
repack_rtl = prc.pack.task.operation.RestrictToRepackingRLT()
repack_ops = prc.pack.task.operation.OperateOnResidueSubset(repack_rtl, residues_to_repack)
# No repack, no design
frozen_sele = pr_res.NotResidueSelector(pr_res.OrResidueSelector(design_sele, repack_sele))
residues_to_freeze = frozen_sele.apply(pose)
prevent_rtl = prc.pack.task.operation.PreventRepackingRLT()
frozen_ops = prc.pack.task.operation.OperateOnResidueSubset(prevent_rtl, residues_to_freeze)
# pyrosetta.rosetta.core.pack.task.operation.RestrictAbsentCanonicalAASRLT
# pyrosetta.rosetta.core.pack.task.operation.PreserveCBetaRLT
task_factory = prc.pack.task.TaskFactory()
task_factory.push_back(no_cys_ops)
task_factory.push_back(repack_ops)
task_factory.push_back(frozen_ops)
return task_factory
Favour native residues
A residue is changed because the variant has a better score than the wild type —technically within the metropolis criterion of the Monte Carlo sampling. Therefore it is possible for the sequence to drift near neutrally, which is not what we want here. To this effect the constraint ResidueTypeConstraint addresses this by favouring the native residue. This is a constraint, not a packer operation, and is controlled by the weight pyrosetta.rosetta.core.scoring.ScoreType.res_type_constraint. There are a least three ways to declare the constraint:
# native residues relative to pose object — the pose is used to get name3 and is not kept (so can change) # very much like the 'constrain to starting positions' in relax which creates a myriad coordinate constraints prc.scoring.constraints.ResidueTypeConstraint(pose, seqpos=1, native_residue_bonus=1) # or relative to user specified residue prc.scoring.constraints.ResidueTypeConstraint(seqpos=1, aa_in='A', name3_in='ALA', native_residue_bonus=1) # or using a wrapper for every residue in the pose prp.protein_interface_design.FavorNativeResidue(pose, 1) # all options require scorefxn.set_weight(pyrosetta.rosetta.core.scoring.ScoreType.res_type_constraint, 1)
The bonus and weight to add is 1 and 1 in this example, but on preliminary tests, I feel like 1 and 2 is a better call, but that is a discussion for another time.
The ProhibitSpecifiedBaseResidueTypes operation restricts against given amino acids, while to allow non-canonical amino acids a packer palette needs to be added (task_factory.set_packer_palette), but neither gives weights to given residues. One simple option to do that would be to add multiple FavorNativeResidue constraints —they need not be wrapped in AmbiguousConstraint because the non-scoring ones get zero.
Variable length
When I used RFdiffusion I gave the algorithm a range in the contigmap, thus allowing the best length of an insertion to win. This does however make the pipeline a bit complicated. First I had to adapt the helper scripts of ProteinMPNN to be used as Python functions (see footnote). There are then two options for the design step: refer to the metadata or do a sequence alignment. The former requires being organised, while the latter could use the alignment code discussed above.
def design_different_by_trb(pose: pyrosetta.Pose, trb: Dict[str, Any], cycles = 5, scorefxn=None):
"""The RFdiffusion case"""
idx_sele = prc.select.residue_selector.ResidueIndexSelector()
for idx0, b in enumerate(trb['inpaint_seq']):
if b:
continue
idx_sele.append_index(idx0 + 1)
task_factory: prc.pack.task.TaskFactory = create_design_tf(pose, design_sele=idx_sele, distance=0)
if scorefxn is None:
scorefxn = pyrosetta.get_fa_scorefxn()
relax = pyrosetta.rosetta.protocols.relax.FastRelax(scorefxn, cycles)
relax.set_enable_design(True)
relax.set_task_factory(task_factory)
relax.apply(pose)
def design_different_by_alignment(pose: pyrosetta.Pose, ref: pyrosetta.Pose, cycles = 5, scorefxn=None):
"""Not the RFdiffusion case"""
ref = ref.split_by_chain(1)
ref2pose: dict = align_for_atom_map(pose.split_by_chain(1), ref)
conserved = list(ref2pose.values())
idx_sele = pr_res.ResidueIndexSelector()
for i in range(1, len(pose.chain_sequence(1))):
if i not in conserved:
idx_sele.append_index(i)
task_factory: prc.pack.task.TaskFactory = create_design_tf(pose, design_sele=idx_sele, distance=0)
if scorefxn is None:
scorefxn = pyrosetta.get_fa_scorefxn()
relax = pyrosetta.rosetta.protocols.relax.FastRelax(scorefxn, cycles)
relax.set_enable_design(True)
relax.set_task_factory(task_factory)
relax.apply(pose)
The degree of change during a FastDesign run depends on how strained the residues are. So it important that the design be FastRelax properly first and not all at once.
Actually I often use FastRelax in a first pass with the backbone fixed to prevent the model from blowing up.
This degree of change _I think_ makes a good metric for scoring. If the pose.sequence() changed by more than 20% (irrespective of native favouring), then that is hard no.
Scoring
RFdiffusion will be made to generate lots of designs. The reason is because they are not all amazing. So they need to be scored.
Coarse-grain clashes
First, we need to determine if the polyglycine skeleton will clash with the extended neighbourhood when the protein is part of a greater machinery that could not be feasibly fed into RFdiffusion. An example is a membrane, say one were designing a binder to a transmembrane protein, the design should not cross the membrane, so one could get the transmembrane protein from OPM database, superpose the skeleton and use the layers of O / N atoms. For a project of mine, I had a multidimensional polymeric lattice. Scoring via a scorefunction and using only the Lenard–Jones repulsion is a tad overkill and prone to errors. So simply doing distance based calculations works:
skeleton: pyrosetta.Pose = ... clasher: pyrosetta.Pose = ... # where the skeleton combined with this pose, it should not have clashes xyz_skeleton = extract_coords(skeleton) xyz_clasher = extract_coords(clasher) all_distances = np.sqrt(np.sum((xyz_clasher[:, np.newaxis, :] - xyz_skeleton[np.newaxis, :, :]) ** 2, axis=-1)) n_clashing = np.count_nonzero(all_distances < 1.)
If designing interactions, the above but with < 3. will given backbone hydrogen bonding to any atom of the reference.
Say the skeleton passed, we then generate the sequence with proteinMPNN, thread it, relax it and tune the sequence. Now we can do some proper scoring.
∆∆G
The first is the predicted Gibbs free energy from a scorefunction. The catch is that this absolute energy, which should not be taken too seriously. The reason being is that they are optimised for ∆∆G calculations, so have empirical terms (i.e. fudge factors) to make them work, including the REF value (a per residue type weight, ref2015 values shown below), which means that were one to make a linear string of peptide it will not have a predicted absolute energy of folding of zero. However, in terms of magnitude is there is a screaming high value then the model is for the bin.
{'ALA': 1.32468, 'ARG': -0.09474, 'ASN': -1.34026, 'ASP': -2.14574, 'CYS': 3.25479, 'GLN': -1.45095, 'GLU': -2.72453, 'GLY': 0.79816, 'HIS': -0.30065, 'ILE': 2.30374, 'LEU': 1.66147, 'LYS': -0.71458, 'MET': 1.65735, 'PHE': 1.21829, 'PRO': -1.64321, 'SER': -0.28969, 'THR': 1.15175, 'TRP': 2.26099, 'TYR': 0.58223, 'VAL': 2.64269}
Per residue score
The scorefunction is a sum of the scores of each particle/residue, and the per-residue scores are obtainable. These are a common way to determine how accurate the model likely is. The per-residue score should not >+10 kcal/mol, and most residues in minimised PDB protein are under 5 kcal/mol, with one or two at the +8 kcal/mol mark likely due to removed ligands and crystalline waters (implicit solvent is an approximation of bulk solvent, not frozen waters). Two things are worth remembering, the ScoreFunction does not store any data, but Pose.energies does. But both can access this data. Personally I prefer the latter, but the former has weights which is handy.
prc: ModuleType = pyrosetta.rosetta.core
pr_scoring: ModuleType = pyrosetta.rosetta.core.scoring
scorefxn = pyrosetta.ScoreFunction = pyrosetta.get_fa_scorefxn()
# showcase weights:
weights: Dict[str, float] = {w.name: scorefxn.get_weight(w) for w in scorefxn.get_nonzero_weighted_scoretypes()}
print(weights)
# get per residue scores:
res_scores = []
for i in range(1, monomer.total_residue() + 1):
v = pru.vector1_bool(pose.total_residue())
v[i] = 1
res_scores.append(scorefxn.get_sub_score(pose, v))
# some magic can be done with the annoying EMapVector
v: pr_scoring.EMapVector = scorefxn.weights()
print( v[pr_scoring.total_score], v[pr_scoring.fa_sol] )
Where using pose.energies is a lot easier
scorefxn(pose) # fills pose.energies energies: pr_scoring.Energies = pose.energies() typed_energies: Dict[str, float] = energies.active_total_energies() a: npt.NDArray = energies.residue_total_energies_array() energy = pd.DataFrame(a) # amazingly civilised # alternative a specific residue can be queried via EMapVector as above v: pr_scoring.EMapVector = energies.residue_total_energies(1)
Do note that due to the fact that it takes two particles for a two body score (e.g. hydrogen bond), there’s a function to toggle whether these are halved or not (cf. this post).
scorefxn.weights() # Create the EnergyMap emopts = pr_scoring.methods.EnergyMethodOptions(scorefxn.energy_method_options()) emopts.hbond_options().decompose_bb_hb_into_pair_energies(True) scorefxn.set_energy_method_options(emopts)
Parenthetically, if the pd.DataFrame(pose.energies().residue_total_energies_array()) was not saved, but is needed again, here’s a function to read the junk on a scored dumped PDB.
def rosetta_pdb_to_df(pdbblock: str) -> pd.DataFrame:
parsable = False
_data = []
for line in pdbblock.split('\n'):
if '#BEGIN_POSE_ENERGIES_TABLE' in line:
parsable = True
continue
elif '#END_POSE_ENERGIES_TABLE' in line:
break
elif not parsable:
continue
parts = line.strip().split()
if parts[0] == 'label':
_data.append(parts)
elif parts[0] == 'weights':
_data.append([parts[0]] + list(map(float, parts[1:-1])) + [float('nan')])
else:
_data.append([parts[0]] + list(map(float, parts[1:])))
data = pd.DataFrame(_data)
data.columns = data.iloc[0]
data = data.iloc[1:].copy()
return data
Interface energy
The thing we want to have is a nice interface. There are two ways to this. One is civilised, but prone to segfaults: the InterfaceAnalyzerMover mover.
def score_interface(complex: Union[pyrosetta.Pose, Sequence[pyrosetta.Pose]], interface: str):
if isinstance(complex, Sequence):
_complex = complex[0].clone()
for c in complex[1:]:
add_chain(_complex, c)
complex = _complex
ia = pyrosetta.rosetta.protocols.analysis.InterfaceAnalyzerMover(interface)
ia.apply(complex)
return {'complex_energy': ia.get_complex_energy(),
'separated_interface_energy': ia.get_separated_interface_energy(),
'complexed_sasa': ia.get_complexed_sasa(),
'crossterm_interface_energy': ia.get_crossterm_interface_energy(),
'interface_dG': ia.get_interface_dG(),
'interface_delta_sasa': ia.get_interface_delta_sasa()}
Like many other movers, no ligands are allowed.
Interface is a chain letter form, say A_BC. This crashes out above 3 chains I believe.
The other way is scoring the monomers in isolation (pose.split_chain) or translating a silly distance after having reset the pose.energies but preferably repacked the surface layer.
Changed sequence
As said above, how many residues changed when tweaked can be a proxy for something being not quite right.
Number of neighbours
pose2pdb = oligomer.pdb_info().pose2pdb
chainA_sele = pr_res.ChainSelector('A')
# boolean vector:
v = pr_res.NeighborhoodResidueSelector(chainA_sele, distance, False).apply(oligomer)
# vector of pose idxs:
close_residues = prc.select.get_residues_from_subset(v)
# array of PDB numbers
print( [pose2pdb(r) for r in close_residues] )
In another blog post I go through the two different residue neighbourhood selectors, but briefly, NeighborhoodResidueSelector is centroid (roughly the beta carbon), while CloseContactResidueSelector is closest atom.
cc_sele = pr_res.CloseContactResidueSelector() cc_sele.central_residue_group_selector(chainA_sele) cc_sele.threshold(3) # Å
Hydrophobicity & co.
In a tool developed in OPIG, Therapeutic Antibody Profiler, an set of metrics where characterised, such as patches of surface hydrophobicity etc. which most likely would be well worth using in protein binder design. (I’ve not used these yet so don’t have a snippet to share).
Movement
One may be after rigidity. An MD run is best for that, but pyrosetta can do (discussed in detail in another blog post) via the BackrubMover mover.
def movement(original: pyrosetta.Pose,
trials: int = 100, temperature: int = 1.0, replicate_number: int = 20) -> List[float]:
scorefxn = pyrosetta.get_fa_scorefxn()
backrub = pyrosetta.rosetta.protocols.backrub.BackrubMover()
monégasque = pyrosetta.rosetta.protocols.monte_carlo.GenericMonteCarloMover(maxtrials=trials,
max_accepted_trials=trials,
# gen.max_accepted_trials() = 0
task_scaling=5,
# gen.task_scaling()
mover=backrub,
temperature=temperature,
sample_type='low',
drift=True)
monégasque.set_scorefxn(scorefxn)
# find most deviant
rs = []
for i in range(replicate_number):
variant = original.clone()
monégasque.apply(variant)
if monégasque.accept_counter() > 0:
rs.append(pr_scoring.bb_rmsd(variant, original))
else:
rs.append(float('nan'))
return rs
Conclusion
Hopefully this exhaustive, and exhausting, overview of the movers and functions in PyRosetta, has adequately showcases what operations can be done in PyRosetta to better tune and analyse variants from RFdiffusion.
Footnote: ProteinMPNN helper as API functions
As mentioned, my designs had different lengths and I wanted to create the definitions for ProteinMPNN via my script as it was reading from tar.gz archives and other details. But briefly, I parsed the RFdiffusion skeletons thusly:
no_fix = []
definitions = []
global_fixed_chains: Dict[str, List[List[str]]] = {}
global_fixed_positions = {}
for path in paths:
# ## Load data
name = path.stem
pdbblock = get_ATOM_only(path.read_text())
# ## Parse definition (i.e. coordinates)
definition = parse_PDBblock(pdbblock, name)
definitions.append(definition)
# ## Fixed chain
fixed_chains = define_fixed_chains(definition, designed_chain_list='A') # chain A is designed
global_fixed_chains[name] = fixed_chains
# ## Fixed pos
sequence = get_chainA_sequence(pdbblock) # assuming the sequence to change is chain A
# The skeleton will have stretches of glycines where the design was made:
masked = re.sub(r'(G{3,})', lambda match: '_' * len(match.group(1)), sequence)
fixed_list = [[i for i, r in enumerate(masked) if r == '_']]
fixed_positions = define_unfixed_positions(definition, ['A'], fixed_list)
global_fixed_positions[name] = fixed_positions
# write out definitions, global_fixed_chains, global_fixed_positions
with open(chains_definitions_path, 'w') as fh: # only chains_definitions.jsonl is a JSONL
for definition in definitions:
fh.write(json.dumps(definition) + '\n')
with open(fixed_chains_path, 'w') as fh:
json.dump(global_fixed_chains, fh)
with open(fixed_positions_path, 'w') as fh:
json.dump(global_fixed_positions, fh)
Where the functions are:
"""
This is a minor refactoring of the original code, with the following changes:
`parse_PDBblock(pdbblock: str, name: str, chain_alphabet, ca_only=False)`
accepts a PDB block ``pdbblock`` and a name ``name`` and returns a dictionary that forms a line for the JSONL.
The code is mostly kept as was, with minor chances.
"""
import json
from typing import List, Sequence, Dict, Any
import numpy as np
alpha_1 = list("ARNDCQEGHILKMFPSTWYV-")
states = len(alpha_1)
alpha_3 = ['ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLN', 'GLU', 'GLY', 'HIS', 'ILE',
'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL', 'GAP']
aa_1_N = {a: n for n, a in enumerate(alpha_1)}
aa_3_N = {a: n for n, a in enumerate(alpha_3)}
aa_N_1 = {n: a for n, a in enumerate(alpha_1)}
aa_1_3 = {a: b for a, b in zip(alpha_1, alpha_3)}
aa_3_1 = {b: a for a, b in zip(alpha_1, alpha_3)}
def get_ATOM_only(pdbblock: str) -> str:
"""
This gets all ATOM, regardless of name and chain
"""
return '\n'.join([line for line in pdbblock.splitlines() if line.startswith('ATOM')])
def get_chainA_sequence(pdbblock: str) -> str:
sequence = ''
residues_seen = set()
for line in pdbblock.splitlines():
if line.startswith("ATOM") and " CA " in line and " A " in line:
res_info = line[17:26] # Residue name and number for uniqueness
if res_info not in residues_seen:
residues_seen.add(res_info)
res_name = line[17:20].strip()
sequence += three_to_one.get(res_name, '?')
return sequence
three_to_one = {
'ALA': 'A', 'CYS': 'C', 'ASP': 'D', 'GLU': 'E',
'PHE': 'F', 'GLY': 'G', 'HIS': 'H', 'ILE': 'I',
'LYS': 'K', 'LEU': 'L', 'MET': 'M', 'ASN': 'N',
'PRO': 'P', 'GLN': 'Q', 'ARG': 'R', 'SER': 'S',
'THR': 'T', 'VAL': 'V', 'TRP': 'W', 'TYR': 'Y'
}
def AA_to_N(x):
# ["ARND"] -> [[0,1,2,3]]
x = np.array(x);
if x.ndim == 0: x = x[None]
return [[aa_1_N.get(a, states - 1) for a in y] for y in x]
def N_to_AA(x):
# [[0,1,2,3]] -> ["ARND"]
x = np.array(x);
if x.ndim == 1: x = x[None]
return ["".join([aa_N_1.get(a, "-") for a in y]) for y in x]
def inner_parse_PDBblock(pdbblock, atoms=['N', 'CA', 'C'], chain=None) -> tuple:
'''
input: pdb_filename = PDB filename
atoms = atoms to extract (optional)
output: (length, atoms, coords=(x,y,z)), sequence
'''
xyz, seq, min_resn, max_resn = {}, {}, 1e6, -1e6
for line in pdbblock.split("\n"):
if line[:6] == "HETATM" and line[17:17 + 3] == "MSE":
line = line.replace("HETATM", "ATOM ")
line = line.replace("MSE", "MET")
if line[:4] == "ATOM":
ch = line[21:22]
if ch == chain or chain is None:
atom = line[12:12 + 4].strip()
resi = line[17:17 + 3]
resn = line[22:22 + 5].strip()
x, y, z = [float(line[i:(i + 8)]) for i in [30, 38, 46]]
if resn[-1].isalpha():
resa, resn = resn[-1], int(resn[:-1]) - 1
else:
resa, resn = "", int(resn) - 1
# resn = int(resn)
if resn < min_resn:
min_resn = resn
if resn > max_resn:
max_resn = resn
if resn not in xyz:
xyz[resn] = {}
if resa not in xyz[resn]:
xyz[resn][resa] = {}
if resn not in seq:
seq[resn] = {}
if resa not in seq[resn]:
seq[resn][resa] = resi
if atom not in xyz[resn][resa]:
xyz[resn][resa][atom] = np.array([x, y, z])
# ^^ end of xyz loop
# convert to numpy arrays, fill in missing values
seq_, xyz_ = [], []
try:
for resn in range(min_resn, max_resn + 1):
if resn in seq:
for k in sorted(seq[resn]): seq_.append(aa_3_N.get(seq[resn][k], 20))
else:
seq_.append(20)
if resn in xyz:
for k in sorted(xyz[resn]):
for atom in atoms:
if atom in xyz[resn][k]:
xyz_.append(xyz[resn][k][atom])
else:
xyz_.append(np.full(3, np.nan))
else:
for atom in atoms: xyz_.append(np.full(3, np.nan))
return np.array(xyz_).reshape(-1, len(atoms), 3), N_to_AA(np.array(seq_))
except TypeError:
return 'no_chain', 'no_chain'
def get_chain_alphabet():
init_alphabet = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T',
'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
extra_alphabet = [str(item) for item in list(np.arange(300))]
return init_alphabet + extra_alphabet
def parse_PDBblock(pdbblock: str, name: str, ca_only=False):
chain_alphabet = get_chain_alphabet()
my_dict = {}
s = 0
concat_seq = ''
for letter in chain_alphabet:
if ca_only:
sidechain_atoms = ['CA']
else:
sidechain_atoms = ['N', 'CA', 'C', 'O']
xyz, seq = inner_parse_PDBblock(pdbblock=pdbblock, atoms=sidechain_atoms, chain=letter)
if type(xyz) != str:
concat_seq += seq[0]
my_dict['seq_chain_' + letter] = seq[0]
coords_dict_chain = {}
if ca_only:
coords_dict_chain['CA_chain_' + letter] = xyz.tolist()
else:
coords_dict_chain['N_chain_' + letter] = xyz[:, 0, :].tolist()
coords_dict_chain['CA_chain_' + letter] = xyz[:, 1, :].tolist()
coords_dict_chain['C_chain_' + letter] = xyz[:, 2, :].tolist()
coords_dict_chain['O_chain_' + letter] = xyz[:, 3, :].tolist()
my_dict['coords_chain_' + letter] = coords_dict_chain
s += 1
my_dict['name'] = name
my_dict['num_of_chains'] = s
my_dict['seq'] = concat_seq
if s < len(chain_alphabet):
return my_dict
else:
raise Exception('Too many chains')
def define_fixed_chains(chains_definition: Dict[str, Any], designed_chain_list=('A',)):
"""
The fixed chain definition file is a JSON dictionary of name to tuple/list of two: designed_chain_list and fixed_chain_list
"""
all_chain_list = [item[-1:] for item in list(chains_definition) if item[:9] == 'seq_chain'] # ['A','B', 'C',...]
fixed_chain_list = [letter for letter in all_chain_list if
letter not in designed_chain_list] # fix/do not redesign these chains
return list(designed_chain_list), fixed_chain_list
def define_global_fixed_chains(chains_definitions, global_designed_chain_list=('A',)):
return {chains_definition['name']: define_fixed_chains(chains_definition, global_designed_chain_list)
for chains_definition in chains_definitions}
def define_fixed_positions(chains_definition: Dict[str, Any],
designed_chain_list: Sequence[str],
fixed_list: Sequence[Sequence[int]]):
all_chain_list = [item[-1:] for item in list(chains_definition) if item[:9] == 'seq_chain']
fixed_position_dict = {}
for i, chain in enumerate(designed_chain_list):
fixed_position_dict[chain] = fixed_list[i]
for chain in all_chain_list:
if chain not in designed_chain_list:
fixed_position_dict[chain] = []
return fixed_position_dict
def define_unfixed_positions(chains_definition,
designed_chain_list: Sequence[str],
unfixed_list: Sequence[Sequence[int]]):
"""
This will be an entry in a dictionary passed to ``chain_id_jsonl``
(Misnomer: it's not a JSONL, but a JSON, only ``jsonl_path`` is)
:param chains_definition:
:param designed_chain_list:
:param unfixed_list:
:return:
"""
all_chain_list = [item[-1:] for item in list(chains_definition) if item[:9] == 'seq_chain']
fixed_position_dict = {}
for chain in all_chain_list:
seq_length = len(chains_definition[f'seq_chain_{chain}'])
all_residue_list = (np.arange(seq_length) + 1).tolist()
if chain not in designed_chain_list:
fixed_position_dict[chain] = all_residue_list
else:
idx = np.argwhere(np.array(designed_chain_list) == chain)[0][0]
fixed_position_dict[chain] = list(set(all_residue_list) - set(unfixed_list[idx]))
return fixed_position_dict


