Predictive reliability and fault management in exascale systems: State of the art and perspectives

Canal Corretger, Ramon; Hernández Luz, Carles; Tornero Gavilá, Rafael; Cilardo, Alessandro; Massari, Giuseppe; Reghenzani, Federico; Fornaciari, William; Zapater Sancho, Marina; Atienza, David; Oleksiak, Ariel; Wojciech Piatek, Poznan; Abella Ferrer, Jaume

Repository landing page

oai:upcommons.upc.edu:2117/330352

Predictive reliability and fault management in exascale systems: State of the art and perspectives

Authors: Ramon Canal Corretger
Carles Hernández Luz
Rafael Tornero Gavilá
Alessandro Cilardo
Giuseppe Massari
Federico Reghenzani
William Fornaciari
Marina Zapater Sancho
David Atienza
Ariel Oleksiak
Poznan Wojciech Piatek
Jaume Abella Ferrer
Publication date: 1 September 2020
Publisher
Doi

Abstract

Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.This work has received funding from the European Union’s Horizon 2020 (H2020) research and innovation program under the FET-HPC Grant Agreement No. 801137 (RECIPE). Jaume Abella was also partially supported by the Ministry of Economy and Competitiveness of Spain under Contract No. TIN2015-65316-P and under Ramon y Cajal Postdoctoral Fellowship No. RYC-2013-14717, as well as by the HiPEAC Network of Excellence. Ramon Canal is partially supported by the Generalitat de Catalunya under Contract No. 2017SGR0962.Peer ReviewedPostprint (author's final draft

Similar works

Full text

Open in the Core reader

Download PDF

UPCommons. Portal del coneixement obert de la UPC

oai:upcommons.upc.edu:2117/330...

Last time updated on 19/11/2020

This paper was published in UPCommons. Portal del coneixement obert de la UPC.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.