当前位置 >>  首页 >> 合作交流 >> 学术交流

Fault Resilient Computing with MPI

撰稿: 摄影: 发布时间:2011年08月28日

Time: 10:00-11.:30am, August 1th, 2011 (Tuesday)
Place: 446 Room, ICT, CAS
Speaker: Dr. Pavan Balaji


Abstract: 
With the current push towards massive-scale computing systems that comprise millions or billions of processing elements, faults are quickly becoming a common place for scientific simulations. In this talk, I'll describe some of the recent efforts within MPI, which is the most commonly used model for parallel programming, with respect to fault resilience. Specifically, I'll discuss some traditional transparent fault resilience techniques, upcoming features in MPI-3 to deal with fault tolerance, application-level fault tolerance techniques, and fault coordination techniques between multiple software components. I'll also briefly talk about "silent faults" which is an increasing concern in massive-scale systems, and is a fundamental deviation from the current model where we assume that faults are identifiable errors. 

Bio:
Dr. Pavan Balaji holds a joint appointment as an Assistant Computer Scientist at the Argonne National Laboratory, and a research fellow of the Computation Institute at the University of Chicago. He had received his Ph.D. from the Computer Science and Engineering department at the Ohio State University. He has more than 80 publications in these areas and has delivered nearly 120 talks and tutorials at various conferences and research institutes. He has received several awards for his research activities including an Outstanding Researcher award at the Ohio State University, the Director's Technical Achievement award at Los Alamos National Laboratory, and several best paper and other awards. Dr. Balaji has also chaired or edited more than 30 journals, conferences and workshops and served as a technical program committee member in numerous conferences and workshops. He is also a member of the IEEE and ACM.

附件下载: