CS Forum: Sebastian Böcker, Friedrich-Schiller Universität Jena
CS department's public guest lecture on 'Classes for the Masses: Comprehensive categorization of unknowns using tandem mass spectrometry'. The lecture is open to everyone free-of-charge.
Map © OpenStreetMap. Some rights reserved.
Prof Sebastian Böcker
Friedrich-Schiller Universität Jena, Germany
Host: Prof Juho Rousu
Time: 15:15 (coffee at 15:00)
Venue: T3, CS building
Classes for the Masses: Comprehensive categorization of unknowns using tandem mass spectrometry
Abstract
Mass spectrometry is a predominant technology for the analysis and elucidation of metabolites and other small molecules. For many years, automated methods for interpreting tandem mass spectra (MS/MS) were limited to spectra library searching. Recently, novel computational methods such as CSI:FingerID and CFM-ID were introduced, and allow us to search MS/MS data in molecular structure databases. However, our search is naturally restricted to structures present in some database.
Here, we present CANOPUS, a tool for predicting compound categories (such as "cyclic ketones" or "oxosteroids") directly from spectral data. Our tool addresses three fundamental pitfalls for this task: Our MS/MS training data is highly restricted; can we classify categories for which we have no or insufficient positive training data? Can we classify categories if we have positive training only for a subcategory (for example, "steroids" and "oxosteroids")? And, can we classify compounds that are true unknowns, meaning they are presently not contained in any structure database?
For CANOPUS, we can answer all three questions in the affirmative. In addition, it has excellent prediction power: CANOPUS currently predicts 1,143 categories from the Classyfire ChemOnt ontology. Evaluating on independent data, predictions have an average accuracy of 99.2%; for half of the categories, we reach an F1 score (harmonic mean of precision and recall) greater than 0.75.
CANOPUS is able to predict categories without positive MS/MS training data by using a two step approach: First, predict a molecular fingerprint from MS/MS using an array of support vector machines; second, predict the compound categories from these fingerprints using a deep neural network trained on millions of molecular structures. Category predictions can help an expert in the structural elucidation of individual compounds; but applied to a complete dataset, we can get an idea of what compound categories are present in the dataset, considering all compounds in the dataset, both identified and unidentified.
Bio
Sebastian Böcker is a full professor at the Friedrich Schiller University Jena, and leader of the Chair for Bioinformatics at the Institute for Computer Science. Prof. Böcker has a long-standing expertise in bioinformatics, algorithmics, and combinatorial optimization. He is particularly interested in computational methods for mass spectrometry, phylogenetics and supertree methods, gene cluster analysis, and algorithm development for computationally hard problems in bioinformatics. Böcker is one of the authors of the award winning CSI:FingerID metabolite identification system, developed in collaboration with KEPACO research group at Aalto University.
**
Prof. Böcker is visiting KEPACO research group at Aalto CS Department 23-25.8. Contact Juho Rousu if you wish to setup a meeting.