Work with Proteins
This document describes how to work with proteins and use them in Deep Origin tools.
The Protein
class¶
The Protein
class is the primary way to work with proteins in Deep Origin.
Constructing a protein¶
From a file¶
A protein can be constructed from a file:
from deeporigin.drug_discovery import Protein, EXAMPLE_DATA_DIR
protein = Protein.from_file(EXAMPLE_DATA_DIR / "brd.pdb")
From a PDB ID¶
A protein can also be constructed from a PDB ID:
from deeporigin.drug_discovery import Protein
protein = Protein.from_pdb_id("1EBY")
Inspecting the Protein¶
PDB ID¶
To view the PDB ID of a Protein (if it exists, use):
from deeporigin.drug_discovery import Protein
protein = Protein.from_pdb_id("1EBY")
protein.pdb_id
Expected output
1EBY
Getting the protein sequence¶
You can retrieve the amino acid sequences of all polypeptide chains in a protein structure using the sequence
property:
from deeporigin.drug_discovery import Protein
protein = Protein.from_pdb_id("1EBY")
sequences = protein.sequence
for seq in sequences:
print(seq)
This property returns a list of amino acid sequences (as Bio.Seq objects) for each polypeptide chain found in the structure. If the structure contains multiple chains, each chain's sequence is included as a separate entry in the list. This is useful for analyzing the primary structure of the protein or for downstream sequence-based analyses.
Expected output
PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF
PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF
Finding missing residues¶
You can identify missing residues (gaps) in the protein structure using the find_missing_residues
method:
from deeporigin.drug_discovery import Protein
protein = Protein.from_pdb_id("5QSP")
missing = protein.find_missing_residues()
print(missing)
This method scans each chain in the protein and returns a dictionary where the keys are chain IDs and the values are lists of tuples, each representing a gap. Each tuple is of the form (start_resseq, end_resseq)
, indicating that residues between start_resseq
and end_resseq
(exclusive) are missing from the structure.
Expected output
{'A': [(511, 514), (547, 550), (679, 682), (841, 855)],
'B': [(509, 516), (546, 551), (679, 684), (840, 854)]}
Visualizing a protein¶
Browser support
These visualizations work best on Google Chrome. We are aware of issues on other browsers, especially Safari on macOS.
A protein object can be visualized using show
:
from deeporigin.drug_discovery import Protein
protein = Protein.from_pdb_id("1EBY")
protein.show()
A visualization such as this will be shown:
Jupyter notebook required
Visualizations such as these require this code to be run in a jupyter notebook. We recommend using these instructions to install Jupyter.
Modifying and preparing a protein¶
Loop modelling¶
Missing information and gaps in the structure can be filled in using the Loop Modelling tool.
For example, this protein from the PDB has missing elements, as can be seen from the dashed lines below:
from deeporigin.drug_discovery import Protein
protein = Protein.from_pdb_id("5QSP")
protein.show()
We can verify that there are missing residues using the find_missing_residues
method:
from deeporigin.drug_discovery import Protein
protein = Protein.from_pdb_id("5QSP")
protein.find_missing_residues()
Expected output
{'A': [(511, 514), (547, 550), (679, 682), (841, 855)],
'B': [(509, 516), (546, 551), (679, 684), (840, 854)]}
We can use the loop modelling tool to fix this structure using:
protein.model_loops()
protein.show()
We can verify that there are no missing residues anymore:
protein.find_missing_residues()
Expected output
{}
How does loop modelling work?
The current implementation of LoopModeling tool can use known experimental or predicted structures to fill gaps in given protein structure.
The tool works by searching for potential templates for each chain with missing residues in Protein Data Bank (PDB) and specified directory of templates. If the PDB contains the Uniprot IDs, this can also be used to download the predicted AlphaFold2 structures from AF Structural Database.
First, for each chain the full sequence and the sequence of the resolved structure are extracted and aligned to identify gaps as continuous groups of missing residues. If gaps are found the PDB database is searched for templates using specified sequence identity threshold. Structures in the additional template directory and AF structure are added for consideration.
For each template, global 3D alignment is first performed and an attempt is made to transfer the motifs corresponding to missing residues in the target if corresponding residues are present in the given template. The success is evaluated based on CA-CA distances at the edges of the gap and sequence identity of the residues to be transferred.
If the global alignment fails for the given gap, the local alignment is attempted using the specified number of residues adjacent to the gap and the transfer of the structural motif is again attempted.
Based on each found template, a model is constructed with structural motifs that were successfully matched. If the b_mixed_models flag is on, the attempt will be made to fill the gaps where matching was not successful using models based on other templates, sorted by resolution.
Finally, the results for all chains are combined to obtain N possible structures using best models obtained for each chain.
Removing specific residues¶
You can remove specific residue names from a protein structure using the remove_resnames
method:
from deeporigin.drug_discovery import Protein
protein = Protein.from_pdb_id("1EBY")
# Remove water molecules (HOH) and ions (NA, CL)
protein.remove_resnames(exclude_resnames=["HOH", "NA", "CL"])
Residue name format
Residue names in PDB files are always uppercase (e.g., "HOH" for water, "NA" for sodium, "CL" for chloride).
This method modifies the protein structure in place by removing the specified residue names. If no residue names are provided, the protein structure remains unchanged.
Removing HETATM records¶
You can remove HETATM records from the protein structure using the remove_hetatm
method:
from deeporigin.drug_discovery import Protein
protein = Protein.from_pdb_id("1EBY")
# Remove all HETATM records except water and zinc
protein.remove_hetatm(keep_resnames=["HOH"], remove_metals=["ZN"])
This method modifies the protein structure in place by removing HETATM records (heteroatoms) from the structure. You can:
- Keep specific residues by providing their names in
keep_resnames
- Keep specific metals by providing their names in
remove_metals
(these metals will be excluded from removal) - If no arguments are provided, all HETATM records will be removed
Metal names
Metal names should be provided in uppercase (e.g., "ZN" for zinc, "FE" for iron).
Removing water molecules¶
You can remove all water molecules from a protein structure using the remove_water
method:
from deeporigin.drug_discovery import Protein
protein = Protein.from_pdb_id("1EBY")
protein.remove_water()
This method modifies the protein structure in place by removing all water molecules (residue name "HOH"). Unlike remove_resnames
, this method does not return a new protein structure but instead modifies the existing one.
Removing Specific Residues¶
You can remove specific residue types while keeping others:
from deeporigin.drug_discovery import Protein
protein = Protein.from_pdb_id("1EBY")
# Remove specific residue names
protein.remove_resnames(exclude_resnames=["HOH", "SO4"])
# Remove only water molecules
protein_no_water = protein.remove_water()
Chain Selection¶
You can select specific chains from the protein structure:
# Select a single chain
chain_a = protein.select_chain('A')
# Select multiple chains
chains_ab = protein.select_chains(['A', 'B'])
Best Practices¶
- Always visualize the structure before and after preparation to ensure the desired changes were made
- When working with metalloproteins, use
remove_hetatm()
with appropriatekeep_resnames
orremove_metals
parameters - For multi-chain proteins, consider selecting specific chains before preparation
- Save the prepared structure using
to_pdb()
if you need to use it later:
# Save the prepared structure
protein.to_pdb("prepared_protein.pdb")
Common Use Cases¶
Preparing a Protein for Docking¶
# Load and prepare protein
protein = Protein.from_pdb_id("1EBY")
protein.remove_water() # Remove water molecules
protein.remove_hetatm(keep_resnames=['ZN']) # Keep important cofactors
protein.to_pdb("docking_ready.pdb")
Working with Metalloproteins¶
# Load and prepare metalloprotein
protein = Protein.from_pdb_id("1XYZ")
protein.remove_hetatm(remove_metals=['ZN', 'MG']) # Keep metal ions
protein.show()
Multi-chain Protein Preparation¶
# Load and prepare multi-chain protein
protein = Protein.from_pdb_id("1ABC")
chains_ab = protein.select_chains(['A', 'B']) # Select chains A and B
chains_ab.remove_water() # Remove water from selected chains
chains_ab.show()