Evaluating Molecular Representations for Property Prediction: A Computational Chemistry Approach Using Molfeat and AutoGluon

Sergey Kolchenko
7 min readOct 2, 2023

--

Generated witl Bing & DALL-E

Computational chemistry is a fascinating domain where the challenges are as intriguing as the possibilities, especially for a non-specialist in computational chemistry like me. Handling small molecules brings us to an important question: how do we represent them? Traditionally, SMILES (Simplified Molecular Input Line-Entry System) strings have been the go-to option, as they offer a text-based representation of molecules. But how do you make machine learning models understand these strings? There are various methods, such as computing Morgan fingerprints, which encode groups of atoms into a unique binary vector. But the possibilities don’t end there; emerging techniques like ChemGPT can also offer latent space representations. So, which method is optimal? The answer isn’t straightforward. In this blog post, I’ll explore molfeat, a highly versatile framework for aggregating various molecular representations, in conjunction with Autogluon, an AutoML package designed to simplify the modeling process. We’ll test these various molecular representations across 18 different benchmarks from tdcommons.ai to determine which is most effective.

Code

Admittedly, I’m inclined towards automation (in other words, lazy), so I streamlined the data processing and model building phases.

  1. Let’s focus on SMILES standardization.
import datamol as dm
import anndata as ad
import numpy as np
import pandas as pd
from typing import List, Optional, Callable

from molfeat.trans.fp import FPVecTransformer
from molfeat.trans import MoleculeTransformer
from molfeat.trans.pretrained.hf_transformers import PretrainedHFTransformer
from molfeat.trans.pretrained import GraphormerTransformer
from molfeat.trans.pretrained import PretrainedDGLTransformer
from molfeat.calc.pharmacophore import Pharmacophore2D
import numpy as np
import anndata as ad

from autogluon.tabular import TabularDataset, TabularPredictor
from autogluon.tabular.configs.hyperparameter_configs import get_hyperparameter_config

def preprocess_smiles_str(smiles: str) -> str:
mol = dm.to_mol(smiles, ordered=True)
mol = dm.fix_mol(mol)
mol = dm.sanitize_mol(mol, sanifix=True, charge_neutral=False)
mol = dm.standardize_mol(
mol,
disconnect_metals=False,
normalize=True,
reionize=True,
uncharge=False,
stereo=True,
)
return dm.standardize_smiles(dm.to_smiles(mol))

def preprocess_smiles_anndata(
anndata: ad.AnnData,
smiles_column: str = 'SMILES',
clean_smiles_columns: str = 'standard_smiles'
) -> ad.AnnData:
anndata.obs[clean_smiles_columns] = anndata.obs[smiles_column].apply(preprocess_smiles_str)
return anndata

2. Generate different representations.



def generate_representations_for_smiles(
records: np.ndarray,
transformer: Callable[[str], np.ndarray],
batch_size: int = 128,
**kwargs
) -> np.ndarray:
features_list = []
n_batches = len(records) // batch_size
batch_iter = batch_indexable(records, batch_size)
for batch in batch_iter:
try:
features = transformer.transform(batch)
except Exception as e:
print('Error in batch', batch, transformer, e)
features_list.append(features)
features_list = np.concatenate(features_list)
return features_list

def generate_representations_for_anndata(
input_anndata: ad.AnnData,
smiles_column: str = 'SMILES',
fpvec_models: Optional[List[str]] = None,
chemgpt_models: Optional[List[str]] = None,
graphormer_models: Optional[List[str]] = None,
dgl_models: Optional[List[str]] = None,
pharmacophore_models: Optional[List[str]] = None,
hft_models: Optional[List[str]] = None,
molecule_models: Optional[List[str]] = None,
batch_size: int = 128,
device: str = 'cpu',
**kwargs
) -> ad.AnnData:
dm.disable_rdkit_log()
fpvec_models = fpvec_models or DEFAULT_FPVEC_MODELS
chemgpt_models = chemgpt_models or DEFAULT_CHEMGPT_MODELS
graphormer_models = graphormer_models or DEFAULT_GRAPHORMER_MODELS
dgl_models = dgl_models or DEFAULT_DGL_MODELS
pharmacophore_models = pharmacophore_models or DEFAULT_PHARMACOPHORE_MODELS
hft_models = hft_models or DEFAULT_HFT_MODELS
molecule_models = molecule_models or DEFAULT_MOLECULE_MODELS

records = input_anndata.obs[smiles_column].values
# FPVecModels
for model_type in fpvec_models:
transformer = FPVecTransformer(kind=model_type, dtype=float)
input_anndata.obsm[model_type] = generate_representations_for_smiles(records, transformer, batch_size=batch_size)

# PharmacophoreModels
for model_type in pharmacophore_models:
transformer = MoleculeTransformer(featurizer=Pharmacophore2D(factory=model_type), dtype=float)
input_anndata.obsm[model_type] = generate_representations_for_smiles(records, transformer, batch_size=batch_size)

# MoleculeModels
for model_type in molecule_models:
transformer = MoleculeTransformer(featurizer=model_type, dtype=float)
input_anndata.obsm[model_type] = generate_representations_for_smiles(records, transformer, batch_size=batch_size)
# HFTransformers
for model_type in hft_models:
transformer = PretrainedHFTransformer(kind=model_type, notation='smiles', dtype=float, device=device, pooling='mean', random_seed=42, preload=True)
input_anndata.obsm[model_type] = generate_representations_for_smiles(records, transformer, batch_size=batch_size)

# ChemGPTModels
for model_type in chemgpt_models:
transformer = PretrainedHFTransformer(kind=model_type, notation='selfies', dtype=float, device='gpu', pooling='mean', random_seed=42, preload=True)
input_anndata.obsm[model_type] = generate_representations_for_smiles(records, transformer, batch_size=batch_size)

# GraphormerModels
for model_type in graphormer_models:
transformer = GraphormerTransformer(kind=model_type, dtype=float, pooling='mean', preload=True)
input_anndata.obsm[model_type] = generate_representations_for_smiles(records, transformer, batch_size=batch_size)

# DGLModels
for model_type in dgl_models:
transformer = PretrainedDGLTransformer(kind=model_type, dtype=float, pooling='mean', preload=True)
input_anndata.obsm[model_type] = generate_representations_for_smiles(records, transformer, batch_size=batch_size)

return input_anndata

Generating representations for an input AnnData object involves storing them in its .obsm attribute. So what exactly is AnnData? Simply put, AnnData is a flexible structure that combines the best of numpy arrays and pandas DataFrames to manage annotated data. It allows for the storage of annotations related to your samples, features, and even multiple layers of data. In our case, we use the .obsm property to hold various matrices, each of shape (n_samples, n_representation). Although the sample order remains constant, the feature dimensions may vary among different representations. For a deeper dive into this versatile package, the official documentation is a rich resource.

anndata structure, from https://anndata.readthedocs.io/en/latest/

Different model frameworks — be it ChemGPT, DGL, or FPVec — come with their own classes and input parameters. Therefore, we iterate through each model type and store the respective molecular representations. Specifically, we apply feature extraction on every SMILES string from the smiles_column in the input AnnData object.

3. Run AutoML

def fit_automl(
train_ad: ad.AnnData,
test_ad: Optional[ad.AnnData],
label_col: str,
task_problem: str,
task_metric: str,
task_name: str,
val_ad: Optional[ad.AnnData] = None,
time_limit: int = 60,
presets: str = 'good_quality'
) -> pd.DataFrame:
task_results = []
hyperparameters = get_hyperparameter_config('default')
hyperparameters.pop('KNN') # KNN never worked :(
for representation_name in train_ad.obsm.keys():
print(f'Fitting {representation_name}')
representation_shape = train_ad.obsm[representation_name].shape[1]
feature_names = ['feature_' + str(x) for x in range(representation_shape)]
train_dataset = pd.DataFrame(train_ad.obsm[representation_name], columns=feature_names)
train_dataset[label_col] = train_ad.obs[label_col].values
train_dataset = TabularDataset(train_dataset)
if val_ad is not None:
val_dataset = pd.DataFrame(val_ad.obsm[representation_name], columns=feature_names)
val_dataset[label_col] = val_ad.obs[label_col].values
val_dataset = TabularDataset(val_dataset)

test_dataset = pd.DataFrame(test_ad.obsm[representation_name], columns=feature_names)
test_dataset[label_col] = test_ad.obs[label_col].values
test_dataset = TabularDataset(test_dataset)

predictor = TabularPredictor(
problem_type=task_problem,
eval_metric=task_metric,
path=f'models/{task_name}__{representation_name}',
label=label_col,
verbosity=1,
log_to_file=True,
log_file_path=f'logs/{task_name}__{representation_name}'
)
predictor.fit(
train_dataset,
tuning_data=val_dataset if val_ad is not None else None,
use_bag_holdout=True if val_ad is not None else False,
time_limit=time_limit,
hyperparameters=hyperparameters,
presets=presets
)
results = pd.DataFrame(predictor.evaluate(test_dataset, silent=True), index=[representation_name])
results['task'] = task_name
task_results.append(results)
return pd.concat(task_results)

The crux of the operation lies in this function: it fits an autoML model for each representation stored in the .obsm attribute of the input AnnData file. We use both training and validation data sets, especially if early stopping is a consideration. However, it's important to note that we only record the performance score of the model. The actual model type—whether it's XgBoost, logistic regression, or a decision tree—is irrelevant to our primary question: What is the optimal performance achievable using a specific molecular representation?

4. Example of running a single dataset from TDC

task_name = 'DILI'
data = Tox(name=task_name)
split = data.get_split()
label_col = 'Y'
train_ad = ad.AnnData(obs=split['train'])
val_ad = ad.AnnData(obs=split['valid'])
test_ad = ad.AnnData(obs=split['test'])

train_ad = preprocess_smiles_anndata(train_ad, smiles_column='Drug')
val_ad = preprocess_smiles_anndata(val_ad, smiles_column='Drug')
test_ad = preprocess_smiles_anndata(test_ad, smiles_column='Drug')

train_ad = generate_representations_for_anndata(train_ad, smiles_column='standard_smiles')
val_ad = generate_representations_for_anndata(val_ad, smiles_column='standard_smiles')
test_ad = generate_representations_for_anndata(test_ad, smiles_column='standard_smiles')

task_problem = 'binary'
task_metrics = 'average_precision'

task_results = fit_automl(
train_ad=train_ad,
test_ad=test_ad,
label_col=label_col,
task_problem=task_problem,
task_metric=task_metrics,
task_name=task_name,
val_ad=val_ad,
time_limit=time_limit,
presets=presets
)
task_results.to_csv(f'results/{task_name}.csv')

To illustrate, let’s consider the task of predicting DILI (Drug-Induced Liver Injury). Essentially, the goal is to ascertain if a particular molecule will cause liver damage. In this example, we obtain the data, divide it into training, validation, and test sets, preprocess the SMILES strings, generate all feasible molecular representations, and finally fit an autoML model. The target metric used here is average precision.

Results

In this section, we iterate over all processed datasets, ranking each molecular representation based on performance for a given problem. The top-performing representation for a particular dataset earns 10 points, the second-best earns 9 points, and so on, down to the tenth best, which earns 1 point. Representations beyond the tenth receive zero points. As I delved into this analysis, I implemented two key modifications.

First, several tasks were trivial for any molecular representation. For instance, predicting drug absorption from the human gastrointestinal system into the bloodstream posed little challenge, with average precision scores consistently above 0.9, regardless of the representation. Consequently, I chose not to award points for such tasks. Specifically, if the best representation didn’t outperform the average by at least 10%, no points were awarded.

Predicting human intestinal absorption is a difficult task, just not in this dataset — whatever you use, you get .9+ AP!

Second, I introduced a scaling factor to the points awarded. If the top-performing representation is twice as good as the mean performance (across all representations), and the second-best is 1.15 times better than the average, the raw scores are then scaled by their “performance gain over the mean,” calculated as score[i] / mean(score). While statisticians might raise eyebrows at this approach, it aligns with my intuition.

gin_supervised_edgepred performed way better than average on this task…
which resulted in extra points!

After processing 18 tasks, 9 were discarded for being too easy and therefore not suitable benchmarks for computational chemistry. Summing the points across the remaining 9 tasks, the top-performing molecular representations were identified:

It might not come as a surprise for somebody that desc2D, numerical representation of chemical and physical properties of a molecule, are the best for representation when we need to predict something that is directly related to the molecular properties (that sounds like a natural choice for me). Two other interesting models, that are worth trying when building a molecular predictors, are ECFP (Extended-connectivity fingerprints) and two graph NN models coming from the same paper. Perhaps not surprisingly, Transformer-based models didn’t perform well in this benchmark (alghough here no fine tuning was performed and I used the raw latent representation).

Conclusion

Molfeat is a robust library that enables various molecular representations with minimal changes to your existing codebase. Using Autogluon, you can bypass the arduous task of model selection, focusing instead on curating and understanding your dataset. By integrating Molfeat and Autogluon, I evaluated an array of molecular representations using datasets from the Therapeutics Data Commons (TDC). However, as pointed out by Pat Walters, the quest for accurate molecular property prediction demands more diverse and high-quality datasets to validate the efficacy of these models.

--

--