How much does missing data affect the performance of multitask machine learning algorithms?
Presenting author: Antonio de la Vega de León
Abstract: Multitask machine learning algorithms are able to predict several different properties at the same time using a single model. As molecules are tested in more assays nowadays, multitask models represent an attractive option to reduce training time and improve performance. The most common algorithm for multitask prediction are deep neural networks, which showed their promise for toxicology prediction in the Tox21 data challenge. However, most multitarget public data sources are not complete, not all compounds have been tested in all possible assays. That means datasets to train multitask machine learning models are sparse. In this study, we analysed how the performance of different machine learning algorithms were affected by the amount of missing data. We compared deep neural networks to Macau, a multitask model based on Bayesian probabilistic matrix factorisation. We tested the two algorithms on two different complete datasets, the PKIS dataset as a regression task and a set of PubChem assays as a classification task. For these datasets, we simulated the removal of data through different models to create increasingly sparse datasets for training. As expected, as the amount of data decreased the performance of the models also decreased. However, the decrease was not linear. The decrease in performance was very gradual from the full datasets up until 60% of the data was removed. After more than 60% of the data was removed, the performance decreased much more quickly. This effect was consistently observed in the different datasets and algorithms tested. This work provides the first approximation of how much data is “good enough” to produce well performing multitask models.