Skip to content

deeporigin.drug_discovery.chemistry

Contains classes and functions for working with molecules, proteins, and related files.

Defines Ligand as Protein classes, as well as functions for reading/writing SDF files, SMILES / SDF Conversion, validating data, DataFrame integration, and preparing visualizations. These can be used together with the drug_discovery module for tasks such as docking.

  • Ligand: Represents a small molecule ligand, accepting either a file path (SDF) or a SMILES string. Provides show() method to display it.
  • Protein: Represents a protein, accepting a local file path (PDB) or a PDB ID. Provides show() method to display it.

Classes

Ligand dataclass

Class to represent a ligand (typically backed by a SDF file)

Attributes

file class-attribute instance-attribute
file: Optional[str | Path] = None
properties class-attribute instance-attribute
properties: Optional[dict] = None
smiles_string class-attribute instance-attribute
smiles_string: Optional[str] = None

Functions

from_csv classmethod
from_csv(
    *,
    file: str | Path,
    smiles_column: str,
    properties_columns: list[str] = None
) -> list[Ligand]

create a list of ligands from a CSV file

Parameters:

Name Type Description Default
file str | Path

Path to CSV file

required
smiles_column str

Column name containing SMILES strings

required
properties_columns list[str]

List of column names to extract as properties

None

Returns:

Type Description
list[Ligand]

List of Ligand objects

from_smiles classmethod
from_smiles(smiles: str) -> Ligand

create a ligand from a SMILES string

show
show()

show a ligand in a Jupyter notebook using molstar

Protein dataclass

Class to represent a protein (typically backed by a PDB file)

Attributes

file class-attribute instance-attribute
file: Optional[str | Path] = None
name class-attribute instance-attribute
name: Optional[str] = None
pdb_id class-attribute instance-attribute
pdb_id: Optional[str] = None

Functions

show
show()

visualize the protein in a Jupyter notebook using molstar

Functions

canonicalize_smiles

canonicalize_smiles(smiles: str) -> str

Canonicalize a SMILES string.

Parameters:

Name Type Description Default
smiles str

SMILES string.

required

Returns:

Name Type Description
str str

Canonicalized SMILES string.

count_molecules_in_sdf_file

count_molecules_in_sdf_file(sdf_file: str | Path) -> int

Count the number of valid (sanitizable) molecules in an SDF file using RDKit, while suppressing RDKit's error logging for sanitization issues.

Parameters:

Name Type Description Default
sdf_file str | Path

Path to the SDF file.

required

Returns:

Name Type Description
int int

The number of molecules successfully read in the SDF file.

download_protein

download_protein(pdb_id: str, save_dir: str = '.') -> str

Downloads a PDB structure by its PDB ID from RCSB and saves it to the specified directory.

Parameters:

Name Type Description Default
pdb_id str

PDB ID of the protein.

required
save_dir str

Directory to save the downloaded PDB file.

'.'

Returns:

Name Type Description
str str

Path to the downloaded PDB file.

Raises:

Type Description
Exception

If the download fails.

filter_sdf_by_smiles

filter_sdf_by_smiles(
    *,
    input_sdf_file: str | Path,
    output_sdf_file: str | Path,
    keep_only_smiles: list[str] | Series
)

Extracts the SMILES strings of all valid molecules from an SDF file using RDKit.

Parameters:

Name Type Description Default
input_sdf_file str | Path

Path to the SDF file.

required
output_sdf_file str | Path

Path to the output SDF file.

required
keep_only_smiles list[str] | Series

List or Series of SMILES strings to keep.

required

get_properties_in_sdf_file

get_properties_in_sdf_file(sdf_file: str | Path) -> list

Returns a list of all user-defined properties in an SDF file

Parameters:

Name Type Description Default
sdf_file str | Path

Path to the SDF file.

required

Returns:

Name Type Description
list list

A list of the names of all user-defined properties in the SDF file.

ligands_to_dataframe

ligands_to_dataframe(ligands: list[Ligand])

convert a list of ligands to a pandas dataframe

merge_sdf_files

merge_sdf_files(
    sdf_file_list: list[str],
    output_path: Optional[str] = None,
) -> str

Merge a list of SDF files into a single SDF file.

Parameters:

Name Type Description Default
sdf_file_list list of str

List of paths to SDF files.

required

Returns:

Name Type Description
str str

Path to the merged SDF file.

read_molecules_in_sdf_file

read_molecules_in_sdf_file(
    sdf_file: str | Path,
) -> list[dict]

Reads an SDF file containing one or more molecules, and for each molecule: - Extracts the SMILES string - Extracts all user-defined properties

Returns:

Type Description
list[dict]

list[dict]: A list of dictionaries, where each dictionary has: - "smiles_string": str - "properties": dict

read_property_values

read_property_values(sdf_file: str | Path, key: str)

Given a SDF file with more than 1 molecule, return the values of the properties for each molecule

Parameters:

Name Type Description Default
sdf_file str | Path

Path to the SDF file.

required
key str

The key of the property to read.

required

read_sdf_properties

read_sdf_properties(sdf_file: str | Path) -> dict

Reads all user-defined properties from an SDF file (single molecule) and returns them as a dictionary.

Parameters:

Name Type Description Default
sdf_file str | Path

Path to the SDF file.

required

sdf_to_smiles

sdf_to_smiles(sdf_file: str | Path) -> list[str]

Extracts the SMILES strings of all valid molecules from an SDF file using RDKit.

Parameters:

Name Type Description Default
sdf_file str | Path

Path to the SDF file.

required

Returns:

Type Description
list[str]

list[str]: A list of SMILES strings for all valid molecules in the file.

show_ligands

show_ligands(ligands: list[Ligand])

show ligands in the FEP object in a dataframe. This function visualizes the ligands using core-aligned 2D visualizations.

Parameters:

Name Type Description Default
ligands list[Ligand]

list of ligands

required

show_molecules_in_sdf_file

show_molecules_in_sdf_file(sdf_file: str | Path)

show molecules in an SDF file in a Jupyter notebook using molstar

show_molecules_in_sdf_files

show_molecules_in_sdf_files(sdf_files: list[str])

show molecules in an SDF file in a Jupyter notebook using molstar

smiles_list_to_base64_png_list

smiles_list_to_base64_png_list(
    smiles_list: list[str],
    *,
    size: Tuple[int, int] = (300, 100),
    scale_factor: int = 2,
    reference_smiles: Optional[str] = None
) -> list[str]

Convert a list of SMILES strings to a list of base64-encoded PNG tags.

This aligns images so that they have consistent core orientation.

Parameters:

Name Type Description Default
smiles_list list[str]

List of SMILES strings.

required
size Tuple[int, int]

(width, height) of the final rendered image in pixels (CSS downscaled).

(300, 100)
scale_factor int

Factor to generate higher-resolution images internally.

2
reference_smiles Optional[str]

If provided, all molecules will be oriented to match the 2D layout of this reference molecule.

None

smiles_to_base64_png

smiles_to_base64_png(
    smiles: str, *, size=(300, 100), scale_factor: int = 2
) -> str

Convert a SMILES string to an inline base64 tag. Use this if you want to convert a single molecule into an image. If you want to convert a set of SMILES strings (corresponding to a set of related molecules) to images, use smiles_list_to_base64_png_list.

Parameters:

Name Type Description Default
smiles str

SMILES string.

required
size Tuple[int, int]

(width, height) of the final rendered image in pixels (CSS downscaled).

(300, 100)
scale_factor int

Factor to generate higher-resolution images internally.

2

smiles_to_sdf

smiles_to_sdf(smiles: str, sdf_path: str) -> None

convert a SMILES string to a SDF file

Parameters:

Name Type Description Default
smiles str

SMILES string

required
sdf_path str

Path to the SDF file

required

split_sdf_file

split_sdf_file(
    *,
    input_sdf_path: str | Path,
    output_prefix: str = "ligand",
    output_dir: Optional[str | Path] = None,
    name_by_property: str = "_Name"
) -> list[Path]

Splits a multi-ligand SDF file into individual SDF files, optionally placing the output in a user-specified directory. Each output SDF is named using the molecule's name (if present) or a fallback prefix.

Parameters:

Name Type Description Default
input_sdf_path str | Path

Path to the input SDF file containing multiple ligands.

required
output_prefix str

Prefix for unnamed ligands. Defaults to "ligand".

'ligand'
output_dir Optional[str | Path]

Directory to write the output SDF files to. If None, output files are written to the same directory as input_sdf_path.

None

Returns:

Type Description
list[Path]

list[Path]: A list of paths to the generated SDF files.