Skip to content

deeporigin.drug_discovery.chemistry

Contains functions for working with SDF files.

Attributes

KeyType module-attribute

KeyType = Literal['smiles', 'inchi']

Functions

canonicalize_smiles

canonicalize_smiles(smiles: str) -> str

Canonicalize a SMILES string.

Parameters:

Name Type Description Default
smiles str

SMILES string.

required

Returns:

Name Type Description
str str

Canonicalized SMILES string.

count_molecules_in_sdf_file

count_molecules_in_sdf_file(sdf_file: str | Path) -> int

Count the number of valid (sanitizable) molecules in an SDF file using RDKit, while suppressing RDKit's error logging for sanitization issues.

Parameters:

Name Type Description Default
sdf_file str | Path

Path to the SDF file.

required

Returns:

Name Type Description
int int

The number of molecules successfully read in the SDF file.

full_graph_map

full_graph_map(
    mol_a: Mol, mol_b: Mol, ignore_hs: bool = True
) -> Optional[list[Tuple[int, int]]]

Return atom map for identical graphs (isomorphic).

group_by_prop_smiles_to_multiconf

group_by_prop_smiles_to_multiconf(
    sdf_path: str,
    *,
    smiles_prop_name: str = "SMILES",
    keep_hs: bool = False,
    align_conformers: bool = True,
    skip_no_coords: bool = True
) -> dict[str, Mol]

Read an SDF that contains many poses (possibly for multiple ligands) and group them by an SDF property (default: ). For each unique value, return one RDKit Mol holding all poses as conformers.

Returns

dict[str, Chem.Mol]: {prop_smiles_value -> Mol with N conformers}

mcs_map

mcs_map(
    mol_a: Mol,
    mol_b: Mol,
    ignore_hs: bool = True,
    ring_matches_ring_only: bool = True,
    complete_rings_only: bool = True,
    match_valences: bool = True,
    match_chiral_tag: bool = False,
    timeout: int = 10,
) -> Optional[list[Tuple[int, int]]]

Return an atom map for the maximum common substructure (subset comparison).

pairwise_pose_rmsd

pairwise_pose_rmsd(
    mols: Sequence[Mol],
    *,
    conf_id: int = 0,
    ignore_hs: bool = True,
    use_mcs_if_needed: bool = True,
    fill_value_for_unmapped: float = nan
)

NxN matrix of pose-sensitive RMSDs (no alignment). If two mols can’t be mapped, entry is fill_value_for_unmapped (default NaN).

pose_rmsd

pose_rmsd(
    mol_a: Mol,
    mol_b: Mol,
    *,
    conf_id_a: int = 0,
    conf_id_b: int = 0,
    ignore_hs: bool = True,
    use_mcs_if_needed: bool = True
) -> Optional[float]

Pose-sensitive RMSD: NO alignment, NO centering. High if the same structure is translated/rotated. Tries full-graph mapping; if that fails and use_mcs_if_needed=True, uses MCS subset mapping. Returns None if no mapping found.

raw_rmsd_from_map

raw_rmsd_from_map(
    mol_a: Mol,
    mol_b: Mol,
    atom_map: list[Tuple[int, int]],
    conf_id_a: int = 0,
    conf_id_b: int = 0,
) -> float

Compute RMSD directly from coordinates on a given atom mapping. NO alignment, NO centering.

read_property_values

read_property_values(sdf_file: str | Path, key: str)

Given a SDF file with more than 1 molecule, return the values of the properties for each molecule

Parameters:

Name Type Description Default
sdf_file str | Path

Path to the SDF file.

required
key str

The key of the property to read.

required

sdf_to_smiles

sdf_to_smiles(sdf_file: str | Path) -> list[str]

Extracts the SMILES strings of all valid molecules from an SDF file using RDKit.

Parameters:

Name Type Description Default
sdf_file str | Path

Path to the SDF file.

required

Returns:

Type Description
list[str]

list[str]: A list of SMILES strings for all valid molecules in the file.

smiles_to_sdf

smiles_to_sdf(smiles: str, sdf_path: str) -> None

convert a SMILES string to a SDF file

Parameters:

Name Type Description Default
smiles str

SMILES string

required
sdf_path str

Path to the SDF file

required

split_sdf_file

split_sdf_file(
    *,
    input_sdf_path: str | Path,
    output_prefix: str = "ligand",
    output_dir: Optional[str | Path] = None,
    name_by_property: str = "_Name"
) -> list[Path]

Splits a multi-ligand SDF file into individual SDF files, optionally placing the output in a user-specified directory. Each output SDF is named using the molecule's name (if present) or a fallback prefix.

Parameters:

Name Type Description Default
input_sdf_path str | Path

Path to the input SDF file containing multiple ligands.

required
output_prefix str

Prefix for unnamed ligands. Defaults to "ligand".

'ligand'
output_dir Optional[str | Path]

Directory to write the output SDF files to. If None, output files are written to the same directory as input_sdf_path.

None

Returns:

Type Description
list[Path]

list[Path]: A list of paths to the generated SDF files.