Data curation workflow
One of the most time-consuming parts of any data analysis is curating the input data prior to any model building. This Knime workflow is fully documented and described and as such is an invaluable starting point.
A semi-automated procedure is made available to support scientists in data preparation for modelling purposes. The procedure address:
- Automatic chemical data retrieval (i.e., SMILES) from different, orthogonal web based databases, by using two different identifiers, i.e. chemical name and CAS registration number. Records were scored based on the coherence of information retrieved from different web sources.
- Data curation procedure performed to top scored records. The procedure includes removal of inorganic and organometallic compounds and mixtures, neutralization of salts, removal of duplicates, checking of tautomeric forms.
- Standardization of chemical structures yielding to ready-to-use data for the development of QSARs.