Transparent Fault Tolerance for Scalable Functional Computation

Stewart, Robert; Trinder, Phil; Maier, Patrick

Repository landing page

oai:pure.atira.dk:publications/850b6c4f-8516-4cae-909a-2a671586f5f9

Transparent Fault Tolerance for Scalable Functional Computation

Authors: Robert Stewart
Phil Trinder
Patrick Maier
Publication date: 17 March 2016
Publisher
Doi

Abstract

Reliability is set to become a major concern on emergent large-scalearchitectures. While there are many parallel languages, and indeedmany parallel functional languages, very few address reliability. Thenotable exception is the widely emulated Erlang distributed actormodel that provides explicit supervision and recovery of actors withisolated state.We investigate scalable transparent fault tolerant functionalcomputation with automatic supervision and recovery of tasks. We do soby developing HdpH-RS, a variant of the Haskell distributed parallelHaskell (HdpH) DSL with Reliable Scheduling. Extending the distributedwork stealing protocol of HdpH for task supervision and recovery ischallenging. To eliminate elusive concurrency bugs, we validate theHdpH-RS work stealing protocol using the SPIN model checker.HdpH-RS differs from the actor model in that its principal entitiesare tasks, i.e. independent stateless computations, rather thanisolated stateful actors. Thanks to statelessness, fault recovery canbe performed automatically and entirely hidden in the HdpH-RS runtimesystem. Statelessness is also key for proving a crucial property ofthe semantics of HdpH-RS: fault recovery does not change the result ofthe program, akin to deterministic parallelism.HdpH-RS provides a simple distributed fork/join-style programmingmodel, with minimal exposure of fault tolerance at the language level,and a library of higher level abstractions such as algorithmicskeletons. In fact, the HdpH-RS DSL is exactly the same as the HdpHDSL, hence users can opt in or out of fault tolerant execution withoutany refactoring.Computations in HdpH-RS are always as reliable as the root node, nomatter how many nodes and cores are actually used. We benchmarkHdpH-RS on conventional clusters and an HPC platform: all benchmarkssurvive Chaos Monkey random fault injection; the system scales welle.g. up to 1400 cores on the HPC; reliability and recovery overheadsare consistently low even at scale

article

Similar works

Full text

Open in the Core reader

Download PDF

Heriot Watt Pure

oai:pure.atira.dk:publications...

Last time updated on 28/02/2020

This paper was published in Heriot Watt Pure.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.