Skip to content

Feature engineering for molecular SMILES data

Access this AI accelerator on GitHub

SMILES (simplified molecular input line entry system) is a textual representation of molecular structures. While it's compact and widely used in cheminformatics, SMILES strings must be transformed into numerical representations to be used effectively in machine learning models.

This accelerator introduces a feature engineering pipeline tailored for SMILES-formatted molecular data. It demonstrates how to convert raw SMILES strings into machine-learning-ready features using RDKit and other tools. It is recommended to run the accelerator in a DataRobot codespace using a GPU environment.

This accelerator's workflow is summarized below:

  1. Preprocess and visualize SMILES strings using RDKit and py3Dmol.
  2. Extract molecular descriptors statistical features (physicochemical properties).
  3. Extract TF-IDF features from SMILES strings, and then apply TruncatedSVD to obtain lower-dimensional embeddings.
  4. Extract fingerprints features from SMILES strings, then apply TruncatedSVD to obtain lower-dimensional embeddings.
  5. Extract semantic representations from pretrained molecular embeddings of ChemBERTa and SMILESBERT (CPU is slow, so GPU is recommended), and then apply PCA to obtain lower-dimensional embeddings.
  6. Run Autopilot with these features to compare model performance and create benchmarks.
  7. Extract feature contribution (SHAP values).