Données de réplication pour : Towards the improvement of thermodynamic solubility prediction – a review

Fiche du document

Date

5 janvier 2023

Discipline
Type de document
Identifiant



Citer ce document

Pierre Llompart et al., « Données de réplication pour : Towards the improvement of thermodynamic solubility prediction – a review », Recherche Data Gouv, ID : 10.57745/CZVZIA


Métriques


Partage / Export

Résumé 0

Evaluating thermodynamic solubility is crucial to design successful drug candidates. Yet, predicting it with in silico approaches remains a challenge. Machine learning methods are used to develop regression models leveraged on molecular descriptors. Recently, powerful solubility predictive models have been published using feature- and graph-based neural networks. These models often display attractive performances, yet, their reliability may be deceiving when used for prospective prediction. This review investigates the origins of these discrepancies, following three directions: a historical perspective, an analysis of the structure of the aqueous solubility dataverse and data quality. We demonstrate that new models are not ready for public usage because they lack a well-defined applicability domain and they overlook some historical data sources. On the basis of carefully reviewed dataset we are able to illustrate the influence the data quality on model predictivity. We comprehensively investigated over 20 years of published solubility datasets and models, highlighting overlooked and interconnected datasets. We benchmarked recently published models on a Sanofi dataset, as an example of pharmaceutical context, and they performed poorly. We observed the impact of factors influencing the performances of the models: interlaboratory standard deviation, ionic state of the solute and source of the solubility data. As a consequence we draw a general workflow to cure aqueous solubility data with the aim of producing predictive models. Our results show how data quality and applicability domain of public models have an impact on their utility in a real context in pharmaceutical industry. We found that some data sources may appear as less reliable than initially expected, as for instance, the eChem dataset. This exhaustive aqueous solubility data analysis led to the development of a curation workflow; the resulting models and datasets are publicly available. Data are available as CSV files. File AqSolDBc.csv Curated data from the AqSolDB. The available columns are: ID Compound ID (string) InChI InChI code of the chemical structure (string) Solubility Mole/L logarithm of the thermodynamic solubility in water at pH 7 (+/-1) at ~300K (float) SMILEScurated Curated SMILES code of the chemical structure (string) SD Standard laboratory Deviation, default value: -1 (float) Group Data quality label imported from AqSolDB (string) Dataset Source of the data point (string) Composition Purity of the substance: mono-constituent, multi-constituent, UVCB (Categorical) Error Identifier error on the data point, default value: None (String) Charge Estimated formal charge of the compound at pH 7: Positive, Negative, Zwiterion, Uncharged (Categorical) File OChemUnseen.csv Solubility data from OChem, curated and orthogonal to AqSolDB. The available columns are: SMILES Curated SMILES code of the chemical structure (string) LogS Mole/L logarithm of the thermodynamic solubility in water at pH 7 (+/-1) (float) File OChemOverlapping.csv Solubility data from OChem, curated; chemical structures are also present inside AqSolDB. The available columns are: SMILES Curated SMILES code of the chemical structure (string) LogS Mole/L logarithm of the thermodynamic solubility in water at pH 7 (+/-1) (float) File OChemCurated.csv Solubility data from OChem, curated. The available columns are: ID Compound ID (string) Name Compound name (string) SMILES Curated SMILES code of the chemical structure (string) SDi Standard laboratory Deviation, default value: -1 (float) Reference Unformated bibliographic reference which the data point is originating from (string) LogS Mole/L logarithm of the thermodynamic solubility in water at pH 7 (+/-1) (float) EXTERNALID Compound ID as appearing in its data source, default value: None (string) CASRN CAS number of the compound, default value: None (string) ARTICLEID Source ID linked to the column Reference (string) Temperature Temperature of the measure, in K (float)

document thumbnail

Par les mêmes auteurs

Sur les mêmes sujets

Sur les mêmes disciplines

Exporter en