Data augmentation (DA) algorithms are widely used for Bayesian inference due to their simplicity. In massive data settings, however, DA algorithms are prohibitively slow because they pass through the full data in any iteration, imposing severe restrictions on their usage despite the advantages. Addressing this problem, we develop a framework for extending any DA that exploits asynchronous and distributed computing. The extended DA algorithm is indexed by a parameter 0 < r < 1 and is called Asynchronous and Distributed (AD) DA with the original DA as its parent. Any ADDA starts by dividing the full data into k smaller disjoint subsets and storing them on k processes, which could be machines or processors. Every iteration of ADDA augments only an r-fraction of the k data subsets with some positive probability and leaves the remaining (1-r)-fraction of the augmented data unchanged. The parameter draws are obtained using the r-fraction of new and (1-r)-fraction of old augmented data. For many choices of k and r, the fractional updates of ADDA lead to a significant speed-up over the parent DA in massive data settings, and it reduces to the distributed version of its parent DA when r=1. The ADDA Markov chain is Harris ergodic with the desired stationary distribution under mild conditions on the parent DA algorithm. We demonstrate the numerical advantages of the ADDA in three representative examples corresponding to different kinds of massive data settings encountered in applications. In all these examples, our DA generalization is significantly faster than its parent DA algorithm for all the choices of k and r.
Sanvesh is an Associate Professor in the Department of Statistics and Actuarial Science at the University of Iowa. He arrived at Iowa in Fall 2015 as an Assistant Professor after pursuing post-doctoral research at Duke University, where David Dunson and Barbara Engelhardt were his mentors. He received his PhD in Statistics in 2013 from Purdue University under the guidance of Rebecca Doerge. His academic journey in Statistics started at IIT Kanpur, where he received a five year integrated M.Sc. in Mathematics and Scientific Computing in 2007.
His current research interests focus on scalable Bayesian computations, EM-type algorithms, and dimension reduction. He has been recently interested in the applications of array variate and Gaussian process models for biomedical data analysis, including data from multiple technologies and sources.