# Feature engineering for molecular SMILES data

> Feature engineering for molecular SMILES data - Execute a feature engineering pipeline tailored for
> SMILES-formatted molecular data.

This Markdown file sits beside the HTML page at the same path (with a `.md` suffix). It summarizes the topic and lists links for tools and LLM context.

Companion generated at `2026-05-06T18:17:09.576634+00:00` (UTC).

## Primary page

- [Feature engineering for molecular SMILES data](https://docs.datarobot.com/en/docs/api/dev-learning/accelerators/data-enrichment-prep/smiles.html): Full documentation for this topic (HTML).

## Related documentation

- [Developer documentation](https://docs.datarobot.com/en/docs/api/index.html): Linked from this page.
- [Developer learning](https://docs.datarobot.com/en/docs/api/dev-learning/index.html): Linked from this page.
- [AI accelerators](https://docs.datarobot.com/en/docs/api/dev-learning/accelerators/index.html): Linked from this page.
- [Data enrichment and preparation](https://docs.datarobot.com/en/docs/api/dev-learning/accelerators/data-enrichment-prep/index.html): Linked from this page.

## Documentation content

[Access this AI accelerator on GitHub](https://github.com/datarobot-community/ai-accelerators/tree/main/use_cases_and_horizontal_approaches/Feature%20Engineering%20For%20Molecular%20SMILES)

SMILES (simplified molecular input line entry system) is a textual representation of molecular structures. While it's compact and widely used in cheminformatics, SMILES strings must be transformed into numerical representations to be used effectively in machine learning models.

This accelerator introduces a feature engineering pipeline tailored for SMILES-formatted molecular data. It demonstrates how to convert raw SMILES strings into machine-learning-ready features using RDKit and other tools. It is recommended to run the accelerator in a DataRobot codespace using a GPU environment.

This accelerator's workflow is summarized below:

1. Preprocess and visualize SMILES strings using RDKit and py3Dmol.
2. Extract molecular descriptors statistical features (physicochemical properties).
3. Extract TF-IDF features from SMILES strings, and then apply TruncatedSVD to obtain lower-dimensional embeddings.
4. Extract fingerprints features from SMILES strings, then apply TruncatedSVD to obtain lower-dimensional embeddings.
5. Extract semantic representations from pretrained molecular embeddings of ChemBERTa and SMILESBERT (CPU is slow, so GPU is recommended), and then apply PCA to obtain lower-dimensional embeddings.
6. Run Autopilot with these features to compare model performance and create benchmarks.
7. Extract feature contribution (SHAP values).
