Machine Learning Coffee seminar: "Machine Learning using Unreliable Components: From Matrix Operations to Neural Networks and Stochastic Gradient Descent" Sanghamitra Dutta
Weekly seminars held jointly by Aalto University and the University of Helsinki.
Map © OpenStreetMap. Some rights reserved.
Helsinki region machine learning researchers will start our week by an exciting machine learning talk. The aim is to gather people from different fields of science with interest in machine learning. Porridge and coffee is served at 9:00 and the talk will begin at 9:15.
Subscribe to the mailing list where seminar topics are announced beforehand.
Machine Learning using Unreliable Components: From Matrix Operations to Neural Networks and Stochastic Gradient Descent
Sanghamitra Dutta
PhD Candidate, Carnegie Mellon University
Abstract:
Reliable computation at scale is one key challenge in large-scale machine learning today.Unreliability in computation can manifest itself in many forms, e.g. (i) "straggling" of a few slow processing nodes which can delay your entire computation, e.g., in synchronous gradient descent; (ii) processor failures; (iii) "soft-errors," which are undetected errors where nodes can produce garbage outputs. My focus is on the problem of training using unreliable nodes.
First, I will introduce the problem of training model parallel neural networks in the presence of soft-errors. This problem was in fact the motivation of von Neumann's 1956 study, which started the field of computing using unreliable components. We propose "CodeNet", a unified, error-correction coding-based strategy that is weaved into the linear algebraic operations of neural network training to provide resilience to errors in every operation during training. I will also survey some of the notable results in the emerging area of "coded computing," including my own work on matrix-vector and matrix-matrix products, that outperform classical results in fault-tolerant computing by arbitrarily large factors in expected time. Next, I will discuss the error-runtime trade-offs of various data parallel approaches in training machine learning models in presence of stragglers, in particular, synchronous and asynchronous variants of SGD. Finally, I will discuss some open problems in this exciting and interdisciplinary area.
Parts of this work is accepted at AISTATS 2018 and ISIT 2018.
**
See the next talks at the seminar webpage.
Please spread the news and join us for our weekly habit of beginning the week by an interesting machine learning talk!
Welcome!