[ PROMPT_NODE_26437 ]
Datamol
[ SKILL_DOCUMENTATION ]
# Datamol Cheminformatics Skill
## Overview
Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. Simplify complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native `rdkit.Chem.Mol` instances, ensuring full compatibility with the RDKit ecosystem.
**Key capabilities**:
- Molecular format conversion (SMILES, SELFIES, InChI)
- Structure standardization and sanitization
- Molecular descriptors and fingerprints
- 3D conformer generation and analysis
- Clustering and diversity selection
- Scaffold and fragment analysis
- Chemical reaction application
- Visualization and alignment
- Batch processing with parallelization
- Cloud storage support via fsspec
## Installation and Setup
Guide users to install datamol:
```bash
uv pip install datamol
```
**Import convention**:
```python
import datamol as dm
```
## Core Workflows
### 1. Basic Molecule Handling
**Creating molecules from SMILES**:
```python
import datamol as dm
# Single molecule
mol = dm.to_mol("CCO") # Ethanol
# From list of SMILES
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
mols = [dm.to_mol(smi) for smi in smiles_list]
# Error handling
mol = dm.to_mol("invalid_smiles") # Returns None
if mol is None:
print("Failed to parse SMILES")
```
**Converting molecules to SMILES**:
```python
# Canonical SMILES
smiles = dm.to_smiles(mol)
# Isomeric SMILES (includes stereochemistry)
smiles = dm.to_smiles(mol, isomeric=True)
# Other formats
inchi = dm.to_inchi(mol)
inchikey = dm.to_inchikey(mol)
selfies = dm.to_selfies(mol)
```
**Standardization and sanitization** (always recommend for user-provided molecules):
```python
# Sanitize molecule
mol = dm.sanitize_mol(mol)
# Full standardization (recommended for datasets)
mol = dm.standardize_mol(
mol,
disconnect_metals=True,
normalize=True,
reionize=True
)
# For SMILES strings directly
clean_smiles = dm.standardize_smiles(smiles)
```
### 2. Reading and Writing Molecular Files
Refer to `references/io_module.md` for comprehensive I/O documentation.
**Reading files**:
```python
# SDF files (most common in chemistry)
df = dm.read_sdf("compounds.sdf", mol_column='mol')
# SMILES files
df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol')
# CSV with SMILES column
df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")
# Excel files
df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")
# Universal reader (auto-detects format)
df = dm.open_df("file.sdf") # Works with .sdf, .csv, .xlsx, .parquet, .json
```
**Writing files**:
```python
# Save as SDF
dm.to_sdf(mols, "output.sdf")
# Or from DataFrame
dm.to_sdf(df, "output.sdf", mol_column="mol")
# Save as SMILES file
dm.to_smi(mols, "output.smi")
# Excel with rendered molecule images
dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"])
```
**Remote file support** (S3, GCS, HTTP):
```python
# Read from cloud storage
df = dm.read_sdf("s3://bucket/compounds.sdf")
df = dm.read_csv("https://example.com/data.csv")
# Write to cloud storage
dm.to_sdf(mols, "s3://bucket/output.sdf")
```
### 3. Molecular Descriptors and Properties
Refer to `references/descriptors_viz.md` for detailed descriptor documentation.
**Computing descriptors for a single molecule**:
```python
# Get standard descriptor set
descriptors = dm.descriptors.compute_many_descriptors(mol)
# Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1,
# 'tpsa': 20.23, 'n_aromatic_atoms': 0, ...}
```
**Batch descriptor computation** (recommended for datasets):
```python
# Compute for all molecules in parallel
desc_df = dm.descriptors.batch_compute_many_descriptors(
mols,
n_jobs=-1, # Use all CPU cores
progress=True # Show progress bar
)
```
**Specific descriptors**:
```python
# Aromaticity
n_aromatic = dm.descriptors.n_aromatic_atoms(mol)
aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)
# Stereochemistry
n_stereo = dm.descriptors.n_stereo_centers(mol)
n_unspec = dm.descriptors.n_stereo_centers_unspecified(mol)
# Flexibility
n_rigid = dm.descriptors.n_rigid_bonds(mol)
```
**Drug-likeness filtering (Lipinski's Rule of Five)**:
```python
# Filter compounds
def is_druglike(mol):
desc = dm.descriptors.compute_many_descriptors(mol)
return (
desc['mw'] <= 500 and
desc['logp'] <= 5 and
desc['hbd'] <= 5 and
desc['hba'] >[C:1](=[O:2])[Cl:3]'
rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
# Apply to molecule
reactant = dm.to_mol("CC(=O)O") # Acetic acid
product = dm.reactions.apply_reaction(
rxn,
(reactant,),
sanitize=True
)
# Convert to SMILES
product_smiles = dm.to_smiles(product)
```
**Batch reaction application**:
```python
# Apply reaction to library
products = []
for mol in reactant_mols:
try:
prod = dm.reactions.apply_reaction(rxn, (mol,))
if prod is not None:
products.append(prod)
except Exception as e:
print(f"Reaction failed: {e}")
```
## Parallelization
Datamol includes built-in parallelization for many operations. Use `n_jobs` parameter:
- `n_jobs=1`: Sequential (no parallelization)
- `n_jobs=-1`: Use all available CPU cores
- `n_jobs=4`: Use 4 cores
**Functions supporting parallelization**:
- `dm.read_sdf(..., n_jobs=-1)`
- `dm.descriptors.batch_compute_many_descriptors(..., n_jobs=-1)`
- `dm.cluster_mols(..., n_jobs=-1)`
- `dm.pdist(..., n_jobs=-1)`
- `dm.conformers.sasa(..., n_jobs=-1)`
**Progress bars**: Many batch operations support `progress=True` parameter.
## Common Workflows and Patterns
### Complete Pipeline: Data Loading → Filtering → Analysis
```python
import datamol as dm
import pandas as pd
# 1. Load molecules
df = dm.read_sdf("compounds.sdf")
# 2. Standardize
df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None)
df = df[df['mol'].notna()] # Remove failed molecules
# 3. Compute descriptors
desc_df = dm.descriptors.batch_compute_many_descriptors(
df['mol'].tolist(),
n_jobs=-1,
progress=True
)
# 4. Filter by drug-likeness
druglike = (
(desc_df['mw'] <= 500) &
(desc_df['logp'] <= 5) &
(desc_df['hbd'] <= 5) &
(desc_df['hba'] = 3: # Need multiple examples
print(f"nScaffold: {scaffold}")
print(f"Count: {len(group)}")
print(f"Activity range: {group['activity'].min():.2f} - {group['activity'].max():.2f}")
# Visualize with activities as legends
dm.viz.to_image(
group['mol'].tolist(),
legends=[f"Activity: {act:.2f}" for act in group['activity']],
align=True # Align by common substructure
)
```
### Virtual Screening Pipeline
```python
# 1. Generate fingerprints for query and library
query_fps = [dm.to_fp(mol) for mol in query_actives]
library_fps = [dm.to_fp(mol) for mol in library_mols]
# 2. Calculate similarities
from scipy.spatial.distance import cdist
import numpy as np
distances = dm.cdist(query_actives, library_mols, n_jobs=-1)
# 3. Find closest matches (min distance to any query)
min_distances = distances.min(axis=0)
similarities = 1 - min_distances # Convert distance to similarity
# 4. Rank and select top hits
top_indices = np.argsort(similarities)[::-1][:100] # Top 100
top_hits = [library_mols[i] for i in top_indices]
top_scores = [similarities[i] for i in top_indices]
# 5. Visualize hits
dm.viz.to_image(
top_hits[:20],
legends=[f"Sim: {score:.3f}" for score in top_scores[:20]],
outfile="screening_hits.png"
)
```
## Reference Documentation
For detailed API documentation, consult these reference files:
- **`references/core_api.md`**: Core namespace functions (conversions, standardization, fingerprints, clustering)
- **`references/io_module.md`**: File I/O operations (read/write SDF, CSV, Excel, remote files)
- **`references/conformers_module.md`**: 3D conformer generation, clustering, SASA calculations
- **`references/descriptors_viz.md`**: Molecular descriptors and visualization functions
- **`references/fragments_scaffolds.md`**: Scaffold extraction, BRICS/RECAP fragmentation
- **`references/reactions_data.md`**: Chemical reactions and toy datasets
## Best Practices
1. **Always standardize molecules** from external sources:
```python
mol = dm.standardize_mol(mol, disconnect_metals=True, normalize=True, reionize=True)
```
2. **Check for None values** after molecule parsing:
```python
mol = dm.to_mol(smiles)
if mol is None:
# Handle invalid SMILES
```
3. **Use parallel processing** for large datasets:
```python
result = dm.operation(..., n_jobs=-1, progress=True)
```
4. **Leverage fsspec** for cloud storage:
```python
df = dm.read_sdf("s3://bucket/compounds.sdf")
```
5. **Use appropriate fingerprints** for similarity:
- ECFP (Morgan): General purpose, structural similarity
- MACCS: Fast, smaller feature space
- Atom pairs: Considers atom pairs and distances
6. **Consider scale limitations**:
- Butina clustering: ~1,000 molecules (full distance matrix)
- For larger datasets: Use diversity selection or hierarchical methods
7. **Scaffold splitting for ML**: Ensure proper train/test separation by scaffold
8. **Align molecules** when visualizing SAR series
## Error Handling
```python
# Safe molecule creation
def safe_to_mol(smiles):
try:
mol = dm.to_mol(smiles)
if mol is not None:
mol = dm.standardize_mol(mol)
return mol
except Exception as e:
print(f"Failed to process {smiles}: {e}")
return None
# Safe batch processing
valid_mols = []
for smiles in smiles_list:
mol = safe_to_mol(smiles)
if mol is not None:
valid_mols.append(mol)
```
## Integration with Machine Learning
```python
# Feature generation
X = np.array([dm.to_fp(mol) for mol in mols])
# Or descriptors
desc_df = dm.descriptors.batch_compute_many_descriptors(mols, n_jobs=-1)
X = desc_df.values
# Train model
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X, y_target)
# Predict
predictions = model.predict(X_test)
```
## Troubleshooting
**Issue**: Molecule parsing fails
- **Solution**: Use `dm.standardize_smiles()` first or try `dm.fix_mol()`
**Issue**: Memory errors with clustering
- **Solution**: Use `dm.pick_diverse()` instead of full clustering for large sets
**Issue**: Slow conformer generation
- **Solution**: Reduce `n_confs` or increase `rms_cutoff` to generate fewer conformers
**Issue**: Remote file access fails
- **Solution**: Ensure fsspec and appropriate cloud provider libraries are installed (s3fs, gcsfs, etc.)
## Additional Resources
- **Datamol Documentation**: https://docs.datamol.io/
- **RDKit Documentation**: https://www.rdkit.org/docs/
- **GitHub Repository**: https://github.com/datamol-io/datamol
Source: claude-code-templates (MIT). See About Us for full credits.