Repository landing page

We are not able to resolve this OAI Identifier to the repository landing page. If you are the repository manager for this record, please head to the Dashboard and adjust the settings.

Transparent Fault Tolerance for Scalable Functional Computation

Abstract

Reliability is set to become a major concern on emergent large-scalearchitectures. While there are many parallel languages, and indeedmany parallel functional languages, very few address reliability. Thenotable exception is the widely emulated Erlang distributed actormodel that provides explicit supervision and recovery of actors withisolated state.We investigate scalable transparent fault tolerant functionalcomputation with automatic supervision and recovery of tasks. We do soby developing HdpH-RS, a variant of the Haskell distributed parallelHaskell (HdpH) DSL with Reliable Scheduling. Extending the distributedwork stealing protocol of HdpH for task supervision and recovery ischallenging. To eliminate elusive concurrency bugs, we validate theHdpH-RS work stealing protocol using the SPIN model checker.HdpH-RS differs from the actor model in that its principal entitiesare tasks, i.e. independent stateless computations, rather thanisolated stateful actors. Thanks to statelessness, fault recovery canbe performed automatically and entirely hidden in the HdpH-RS runtimesystem. Statelessness is also key for proving a crucial property ofthe semantics of HdpH-RS: fault recovery does not change the result ofthe program, akin to deterministic parallelism.HdpH-RS provides a simple distributed fork/join-style programmingmodel, with minimal exposure of fault tolerance at the language level,and a library of higher level abstractions such as algorithmicskeletons. In fact, the HdpH-RS DSL is exactly the same as the HdpHDSL, hence users can opt in or out of fault tolerant execution withoutany refactoring.Computations in HdpH-RS are always as reliable as the root node, nomatter how many nodes and cores are actually used. We benchmarkHdpH-RS on conventional clusters and an HPC platform: all benchmarkssurvive Chaos Monkey random fault injection; the system scales welle.g. up to 1400 cores on the HPC; reliability and recovery overheadsare consistently low even at scale

Similar works

This paper was published in Heriot Watt Pure.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.