Machine Learning Coffee Seminar: Assistant Professor Tapio Pahikkala, University of Turku
Weekly seminars held jointly by the Aalto University and the University of Helsinki.
Map © OpenStreetMap. Some rights reserved.
Helsinki region machine learning researchers will start our week by an exciting machine learning talk. The aim is to gather people from different fields of science with interest in machine learning. Porridge and coffee is served at 9:00 and the talk will begin at 9:15. The venue for this talk is Exactum D123, Kumpula.
Subscribe to the mailing list where seminar topics are announced beforehand.
Small Data AUC Estimation of Machine Learning Methods: Pitfalls and Remedies
Tapio Pahikkala
Assistant Professor, University of Turku
Abstract:
Asking whether two populations can be distinguished from each other is one of the most fundamental questions in data analysis and area under ROC curve (AUC) is one of the simplest and most practical tools for answering it. Also known as the Wilcoxon-Mann-Whitney U statistic, it can be associated with a p-value indicating how likely one would obtain as good AUC value if the two populations would not be stochastically different. Estimating AUC of a predictive model and its statistical significance has a huge practical importance in fields like medicine, where one often has access to only small amounts of labeled data but large number of features. Leave-pair-out cross-validation (LPOCV) is an almost unbiased AUC estimator of machine learning methods that has also been empirically shown to be the most reliable of the cross-validation (CV) based estimators. We further study the properties of LPOCV and show some serious pitfalls one can encounter when estimating AUC with CV and how to avoid them. In particular, we show how one can produce very promising results with high AUC values even if there is no signal in the data. Finally, we show how to counter these risks with new Wilcoxon–Mann–Whitney U type of permutation tests adjusted for LPOCV, thus upgrading one of the classical statistical tools for CV estimates.
See the next talks at the seminar webpage.
Please spread the news and join us for our weekly habit of beginning the week by an interesting machine learning talk!
Welcome!