SELFormer: molecular representation learning via SELFIES language models

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data, for efficient usage in subsequent prediction tasks. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose, which is the most extensively used string-based encoding for molecules. However, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model (CLM) that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based CLMs, on predicting aqueous solubility of molecules and adverse drug reactions, while producing comparable results for the remaining tasks. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models at https://github.com/HUBioDataLab/SELFormer. Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.

Related collections

Most cited references 55

Record: found
Abstract: found
Article: not found

Representation learning: a review and new perspectives.

Y Bengio, A. Courville, P. Vincent (2013)

The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.

0 comments Cited 938 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

The ChEMBL database in 2017

Anna Gaulton, Anne Hersey, Michał Nowotka … (2017)

ChEMBL is an open large-scale bioactivity database (https://www.ebi.ac.uk/chembl), previously described in the 2012 and 2014 Nucleic Acids Research Database Issues. Since then, alongside the continued extraction of data from the medicinal chemistry literature, new sources of bioactivity data have also been added to the database. These include: deposited data sets from neglected disease screening; crop protection data; drug metabolism and disposition data and bioactivity data from patents. A number of improvements and new features have also been incorporated. These include the annotation of assays and targets using ontologies, the inclusion of targets and indications for clinical candidates, addition of metabolic pathways for drugs and calculation of structural alerts. The ChEMBL data can be accessed via a web-interface, RDF distribution, data downloads and RESTful web-services.

0 comments Cited 530 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Applications of machine learning in drug discovery and development

Jessica Vamathevan, Dominic Clark, Paul Czodrowski … (2019)

Drug discovery and development pipelines are long, complex and depend on numerous factors. Machine learning (ML) approaches provide a set of tools that can improve discovery and decision making for well-specified questions with abundant, high-quality data. Opportunities to apply ML occur in all stages of drug discovery. Examples include target validation, identification of prognostic biomarkers and analysis of digital pathology data in clinical trials. Applications have ranged in context and methodology, with some approaches yielding accurate predictions and insights. The challenges of applying ML lie primarily with the lack of interpretability and repeatability of ML-generated results, which may limit their application. In all areas, systematic and comprehensive high-dimensional data still need to be generated. With ongoing efforts to tackle these issues, as well as increasing awareness of the factors needed to validate ML approaches, the application of ML can promote data-driven decision making and has the potential to speed up the process and reduce failure rates in drug discovery and development.

0 comments Cited 530 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Tunca Doğan: (View ORCID Profile)

Journal

Title: Machine Learning: Science and Technology

Abbreviated Title: Mach. Learn.: Sci. Technol.

Publisher: IOP Publishing

ISSN (Electronic): 2632-2153

Publication date Created: June 29 2023

Publication date Created: June 01 2023

Publication date (Electronic): June 29 2023

Publication date (Print): June 01 2023

Volume: 4

Issue: 2

Page: 025035

Article

DOI: 10.1088/2632-2153/acdb30

SO-VID: 523abf5a-d8f8-498a-926a-dde79f656645

License:

http://creativecommons.org/licenses/by/4.0

History

Data availability:

Comments

Comment on this article

scite_

Cited by 3

See all cited by

Most referenced authors 671

See all reference authors

SELFormer: molecular representation learning via SELFIES language models

Read this article at

Abstract

Related collections

Language change

Most cited references 55

Representation learning: a review and new perspectives.

The ChEMBL database in 2017

Applications of machine learning in drug discovery and development

Author and article information

Contributors

Journal

Article

History

Comments

Comment on this article

Similar content 323

Cited by 3

Most referenced authors 671