Skip to content

Work with Ligands

This document describes how to work with ligands (molecules) and use them in Deep Origin tools.

There are two classes that help you work with ligands:

Constructing a Ligand or LigandSet

From a SDF file

A single Ligand can be constructed from a SDF file:

from deeporigin.drug_discovery import Ligand, BRD_DATA_DIR

ligand = Ligand.from_sdf(BRD_DATA_DIR / "brd-2.sdf")

A LigandSet can be constructed from a SDF File:

from deeporigin.drug_discovery import LigandSet, DATA_DIR

ligands = LigandSet.from_sdf(DATA_DIR / "ligands" / "ligands-brd-all.sdf")

A LigandSet can be constructed from multiple SDF files by concatenating them together:

from deeporigin.drug_discovery import LigandSet, DATA_DIR

# List of SDF file paths
sdf_files = [
    DATA_DIR / "ligands" / "ligands-brd-all.sdf",
    DATA_DIR / "ligands" / "42-ligands.sdf"
]

# Create LigandSet from multiple files
ligands = LigandSet.from_sdf_files(sdf_files)

# The resulting LigandSet contains all ligands from both files
print(f"Total ligands: {len(ligands)}")  # Should be 8 + 42 = 50

This is particularly useful when you have: - Multiple SDF files from different experiments - Split datasets that you want to combine - Files from different sources that need to be merged

From SMILES string(s)

A ligand can be constructed from a SMILES string, which is a compact way to represent molecular structures:

from deeporigin.drug_discovery import Ligand

ligand = Ligand.from_smiles(
    smiles="c1ccccc1", 
    name="Oxo",     # Optional name for the ligand
)

SMILES Validation

The constructor will raise an exception if the provided SMILES string is invalid or cannot be parsed into a valid molecule.

A LigandSet can be constructed from a list or set of SMILES strings:

from deeporigin.drug_discovery import LigandSet

smiles = {
    "C/C=C/Cn1cc(-c2cccc(C(=O)N(C)C)c2)c2cc[nH]c2c1=O",
    "C=CCCn1cc(-c2cccc(C(=O)N(C)C)c2)c2cc[nH]c2c1=O",
}

ligands = LigandSet.from_smiles(smiles)

From a Chemical Identifier

You can create a ligand from common chemical identifiers (like PubChem names, common names, or drug names). This is particularly useful when working with well-known biochemical molecules:

from deeporigin.drug_discovery import Ligand

# Create ligands from common biochemical names
atp = Ligand.from_identifier(
    identifier="ATP",  
)

serotonin = Ligand.from_identifier(
    identifier="serotonin", 
)

The from_identifier constructor:

  • Accepts common chemical names and identifiers
  • Automatically resolves the identifier to a molecular structure
  • Creates a 3D conformation of the molecule
  • Particularly useful for well-known biochemical molecules like:
    • Nucleotides (ATP, ADP, GTP, etc.)
    • Neurotransmitters (serotonin, dopamine, etc.)
    • Drug molecules (by their generic names)
    • Common metabolites and cofactors

Identifier Resolution

The constructor will attempt to resolve the identifier using chemical databases. If the identifier cannot be resolved, it will raise an exception.

From an RDKit Mol object

If you're working with RDKit molecules directly, you can create a Ligand from an RDKit Mol object:

from deeporigin.drug_discovery import Ligand
from rdkit import Chem

# Create an RDKit molecule
mol = Chem.MolFromSmiles("CCO")  # Ethanol

# Convert to a Ligand
ligand = Ligand.from_rdkit_mol(
    mol=mol,
    name="Ethanol",  # Optional name for the ligand
)

This is particularly useful when you're working with RDKit's molecular manipulation functions and want to convert the results into a Ligand for further processing or visualization.

You can also create a LigandSet from a list of RDKit molecules:

from deeporigin.drug_discovery import LigandSet
from rdkit import Chem

mols = [Chem.MolFromSmiles("CCO"), Chem.MolFromSmiles("CCCO")]
ligands = LigandSet.from_rdkit_mols(mols)

From a CSV file

You can also create a LigandSet from a CSV file containing SMILES strings and optional properties:

from deeporigin.drug_discovery import LigandSet, DATA_DIR

ligands = LigandSet.from_csv(
    file_path=DATA_DIR / "ligands" / "ligands.csv",
    smiles_column="SMILES"  # Optional, defaults to "smiles"
)

The method will:

  • Read the CSV file using pandas
  • Extract SMILES strings from the specified column
  • Create a Ligand instance for each valid SMILES
  • Store all other columns as properties in each Ligand instance
  • Skip any rows with empty or invalid SMILES strings

Error Handling

The method will raise: - FileNotFoundError if the CSV file does not exist - DeepOriginException if the specified SMILES column is not found in the CSV file

From a directory

You can create a LigandSet from a directory containing SDF and CSV files:

from deeporigin.drug_discovery import LigandSet, BRD_DATA_DIR

ligands = LigandSet.from_dir(BRD_DATA_DIR)

This will read all .sdf and .csv files in the directory and combine them into a single LigandSet.

Filtering Top Poses

When working with docking results, you often have multiple poses for the same molecule. The filter_top_poses() method helps you select only the best pose for each unique molecule:

# assuming poses comes from protein.dock() or Complex.docking.get_results()

# Filter to keep only the best pose per molecule (by binding energy)
best_poses = poses.filter_top_poses()

# Or filter by pose score instead
best_poses = poses.filter_top_poses(by_pose_score=True)

Creates New Object

The filter_top_poses() method creates a new LigandSet containing only the best pose for each unique molecule. The original LigandSet is not modified. By default, it selects poses by minimum binding energy, but you can use by_pose_score=True to select by maximum pose score instead.

Visualization

Jupyter notebook required

Visualizations such as these require this code to be run in a jupyter notebook. We recommend using these instructions to install Jupyter.

Browser support

These visualizations work best on Google Chrome. We are aware of issues on other browsers, especially Safari on macOS.

Ligands

A ligand object can be visualized using show:

from deeporigin.drug_discovery import Ligand

ligand = Ligand.from_identifier("serotonin")

ligand.show()

A visualization similar to the following will be shown:

LigandSets

A LigandSet can be visualized using several different methods.

Summary card

Simply inspecting the LigandSet object shows the following:

ligands

LigandSet with 8 ligands

8 unique SMILES

Properties: initial_smiles, r_exp_dg

Use .to_dataframe() to convert to a dataframe, .show_df() to view dataframewith structures, or .show() for 3D visualization

Table view (2D)

To view a dataframe containined rendered (2D) structures of ligands, use:

# for example
from deeporigin.drug_discovery import LigandSet, DATA_DIR

ligands = LigandSet.from_sdf(DATA_DIR / "ligands" / "ligands-brd-all.sdf")
ligands.show_df()

Expected Output

Individual view (3D)

To view 3D structures of all ligands in a LigandSet, use:

from deeporigin.drug_discovery import LigandSet, DATA_DIR

ligands = LigandSet.from_sdf(DATA_DIR / "ligands" / "ligands-brd-all.sdf")
ligands.show()

A visualization similar to this will be shown. Use the arrows to switch between Ligands in the LigandSet.

Grid view (2D)

To view a grid of all 2D structures of all ligands in the LigandSet, use:

ligands.show_grid()

Expected Output

Operations on Ligands

Preparing Ligands

You can prepare a ligand for downstream workflows using the prepare() method. This performs salt removal, kekulization, and validates atom types:

from deeporigin.drug_discovery import Ligand

ligand = Ligand.from_smiles("c1ccccc1")
ligand.prepare(remove_hydrogens=False)  # Mutates the ligand in place, returns self for chaining

Mutation Behavior

The prepare() method mutates the ligand object in place and returns self for method chaining.

Generating 3D Coordinates

You can generate 3D coordinates for a single ligand or all ligands in a LigandSet using the embed() method. This is useful for preparing ligands for docking or other modeling tasks that require 3D structures.

from deeporigin.drug_discovery import Ligand, BRD_DATA_DIR

ligand = Ligand.from_sdf(BRD_DATA_DIR / "brd-2.sdf")
ligand.embed()  # Generates 3D coordinates in place

Mutation Behavior

The embed() method mutates the ligand object by adding 3D coordinates to the molecule.

from deeporigin.drug_discovery import LigandSet, DATA_DIR

ligands = LigandSet.from_sdf(DATA_DIR / "ligands" / "ligands-brd-all.sdf")
ligands.embed()  # Generates 3D coordinates for all ligands in place

This will call the embed() method on each ligand in the set, updating their 3D coordinates. The method returns the LigandSet itself for convenience, so you can chain further operations if desired.

Mutation Behavior

The embed() method mutates all ligands in the set and returns self for method chaining.

Constructing a network using Konnektor

To run RBFE, it is helpful to map out a network within the ligand set, so that we can run RBFE on those pairs of ligands. To do so, use:

# assuming ligands is a LigandSet
ligands.map_network().show_network()

This maps the network and creates a visualization similar to:

Mutation Behavior

The map_network() method mutates the LigandSet by storing the network in self.network and returns self for method chaining.

Predicting ADMET Properties

ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties can be predicted for Ligands or LigandSets.

You can predict ADMET properties for a ligand using the admet_properties method:

# Predict ADMET properties
properties = ligand.admet_properties()

Mutation Behavior

The admet_properties() method mutates the ligand object by storing all predicted properties in ligand.properties. The method also returns a dictionary of the properties.

The method returns a dictionary containing various ADMET-related predictions:

{
    'smiles': 'Cn1c(=O)n(Cc2ccccc2)c(=O)c2c1nc(SCCO)n2Cc1ccccc1',
    'properties': {
        'logS': -4.004,  # Aqueous solubility
        'logP': 3.686,   # Partition coefficient
        'logD': 2.528,   # Distribution coefficient
        'hERG': {'probability': 0.264},  # hERG inhibition risk
        'ames': {'probability': 0.213}, # Ames mutagenicity
        'cyp': {     # Cytochrome P450 inhibition
            'probabilities': {
                'cyp1a2': 0.134,
                'cyp2c9': 0.744,
                'cyp2c19': 0.853,
                'cyp2d6': 0.0252,
                'cyp3a4': 0.4718
            }
        },
        'pains': {    # PAINS (Pan Assay Interference Compounds)
            'has_pains': None,
            'pains_fragments': []
        }
    }
}

The predicted properties are automatically stored in the ligand's properties dictionary and can be accessed later using the get_property method:

# Access a specific property
logP = ligand.get_property('logP')

Property Storage

All predicted properties are automatically stored in the ligand's properties dictionary and can be accessed at any time using the get_property method.

You can predict ADMET properties for all ligands in a LigandSet using the admet_properties method. This will call the prediction for each ligand and display a progress bar using tqdm:

from deeporigin.drug_discovery import LigandSet, DATA_DIR

ligands = LigandSet.from_csv(
    file_path=DATA_DIR / "ligands" / "ligands.csv",
    smiles_column="SMILES"
)

ligands.admet_properties()  

Mutation Behavior

The admet_properties() method mutates all ligands in the set by storing predicted properties in each ligand's .properties attribute.

The properties are stored in each ligand's .properties attribute for later access.

To view ADMET properties of all ligands in the ligand set, simply view the ligandset as a dataframe using:

ligands

or, optionally, convert to a DataFrame for further analysis:

ligands.to_dataframe()

Random Sampling

You can randomly sample ligands from a LigandSet using the random_sample method:

from deeporigin.drug_discovery import LigandSet, DATA_DIR

ligands = LigandSet.from_sdf(DATA_DIR / "ligands" / "ligands-brd-all.sdf")

# Sample 5 random ligands
sample = ligands.random_sample(5)

Creates New Object

The random_sample() method creates a new LigandSet containing copies of the sampled ligands. The original LigandSet is not modified.

Maximum Common Substructure

The Maximum Common Substructure (MCS) for a LigandSet can be computed as follows:

from deeporigin.drug_discovery import BRD_DATA_DIR, LigandSet

ligands = LigandSet.from_dir(BRD_DATA_DIR)
mcs_smarts = ligands.mcs()  # Returns a SMARTS string

Returns New Data

The mcs() method returns a SMARTS string representing the maximum common substructure. It does not mutate the LigandSet or its ligands.

Expected Output

Computing RMSD

You can compute pairwise RMSD (Root Mean Square Deviation) between all ligands in a LigandSet:

from deeporigin.drug_discovery import LigandSet

ligands = LigandSet.from_sdf("docking_results.sdf")
rmsd_matrix = ligands.compute_rmsd()  # Returns a numpy array

Returns New Data

The compute_rmsd() method returns a numpy array containing pairwise RMSD values. It does not mutate the LigandSet or its ligands.

Plotting Ligands

You can create scatter plots of ligands using their properties:

from deeporigin.drug_discovery import LigandSet

ligands = LigandSet.from_sdf("docking_results.sdf")
ligands.plot(
    x="POSE SCORE",
    y="Binding Energy",
    x_label="Pose Score",
    y_label="Binding Energy (kcal/mol)"
)

Visualization Only

The plot() method creates a visualization and optionally saves it to a file. It does not mutate the LigandSet or its ligands.

Constraints

Ligands in a LigandSet can be aligned to a reference ligand using:

from deeporigin.drug_discovery import BRD_DATA_DIR, LigandSet

ligands = LigandSet.from_dir(BRD_DATA_DIR)
constraints = ligands.compute_constraints(reference=ligands[1])

Returns New Data

The compute_constraints() method returns a list of constraint dictionaries. It does not mutate the LigandSet or its ligands.

Protonation

You can protonate ligands at a specific pH. This is useful for preparing ligands for molecular dynamics simulations or other pH-dependent calculations.

from deeporigin.drug_discovery import Ligand

ligand = Ligand.from_smiles("c1ccccc1")
ligand.protonate(ph=7.4)  # Mutates the ligand in place

Mutation Behavior

The protonate() method mutates the ligand object by updating self.mol with the protonated structure. Only the most abundant species at the specified pH is retained.

from deeporigin.drug_discovery import LigandSet

ligands = LigandSet.from_smiles(["c1ccccc1", "CCO"])
ligands.protonate(ph=7.4)  # Mutates all ligands in place

Mutation Behavior

The protonate() method mutates all ligands in the set and returns self for method chaining.

Adding Hydrogens

You can add hydrogens to ligands, which is often necessary before generating 3D coordinates or performing certain calculations.

from deeporigin.drug_discovery import Ligand

ligand = Ligand.from_smiles("c1ccccc1")
ligand.add_hydrogens()  # Mutates the ligand in place

Mutation Behavior

The add_hydrogens() method mutates the ligand object by adding hydrogens to self.mol.

from deeporigin.drug_discovery import LigandSet

ligands = LigandSet.from_smiles(["c1ccccc1", "CCO"])
ligands.add_hydrogens()  # Mutates all ligands in place

Mutation Behavior

The add_hydrogens() method mutates all ligands in the set by calling add_hydrogens() on each ligand.

Exporting ligands

To SDF files

To write a Ligand to a SDF file, use:

from deeporigin.drug_discovery import Ligand

ligand = Ligand.from_smiles("NCCc1c[nH]c2ccc(O)cc12")
ligand.to_sdf()

To write a LigandSet to a SDF file, use:

from deeporigin.drug_discovery import LigandSet

smiles = {
"C/C=C/Cn1cc(-c2cccc(C(=O)N(C)C)c2)c2cc[nH]c2c1=O",
"C=CCCn1cc(-c2cccc(C(=O)N(C)C)c2)c2cc[nH]c2c1=O",
}

ligands = LigandSet.from_smiles(smiles)
ligands.to_sdf()

To mol files

To write a ligand to a mol file, use:

from deeporigin.drug_discovery import Ligand

ligand = Ligand.from_smiles("NCCc1c[nH]c2ccc(O)cc12")
ligand.to_mol()

To PDB files

To write a ligand to a PDB file, use:

from deeporigin.drug_discovery import Ligand

ligand = Ligand.from_smiles("NCCc1c[nH]c2ccc(O)cc12")
ligand.to_pdb()

To Pandas DataFrames

To convert a LigandSet to a Pandas DataFrame, use:

from deeporigin.drug_discovery import LigandSet, DATA_DIR

ligands = LigandSet.from_csv(
    file_path = DATA_DIR / "ligands" / "ligands.csv",
    smiles_column="SMILES"  # Optional, defaults to "smiles"
)
df = ligands.to_dataframe()

To CSV files

To write a LigandSet to a CSV file, use method chaining:

# we're using pandas' native to_csv method here

ligands.to_dataframe().to_csv("temp.csv")