Checkpointing as a Service in Heterogeneous Cloud Environments

Cao, Jiajun; Simonin, Matthieu; Cooperman, Gene; Morin, Christine

Repository landing page

Checkpointing as a Service in Heterogeneous Cloud Environments

Authors: Jiajun Cao
Matthieu Simonin
Gene Cooperman
Christine Morin
Publication date: 7 November 2014
Publisher: HAL CCSD

Abstract

—A non-invasive, cloud-agnostic approach is demon-strated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.Le papier expose une approche offrant un support de checkpoint-restart d’applications pour les plateformes cloud. L’approche est agnostique au fournisseur de l’infrastructure. D’une manière générale, un mécanisme de checkpoint-restart permet (a) une tolérance aux pannes uni- forme pour des applications ayant un important temps d’éxécution ; habituellement la tolérance aux pannes est déléguée à des mécanismes propres à chaque application, et (b) un ordonnancement facilité des tâches souscrites au cloud ; des tâches pouvant être suspendues durant l’éxécution de tâches ayant des priorités plus élevées. L’approche proposée supporte également des applications parallèlles et distribuées utilisant à la fois TCP ou l’Infiniband. Cela permet à des applications traditionnelles HPC de s’intégrer facilement à des architectures cloud. Un mécanisme de monitoring d’applications est en outre proposé permettant de juger de l’état d’une application et éventuellement la redémarrer depuis un état précédemment sauvé. Cette approche est elle aussi agnostique quant à l’infrastructure. La validité de ces mécanismes est démontrée par l’implémentation et l’évaluation d’un service utilisant deux plateformes cloud différentes : Snooze and Openstack. L’agnosticité de l’implémentation permet également, pour la première fois, la migration d’applications d’une plateforme cloud à une autre

Similar works

Full text

Open in the Core reader

Download PDF

INRIA a CCSD electronic archive server

oai:HAL:hal-01086834v2

Last time updated on 09/11/2016

This paper was published in INRIA a CCSD electronic archive server.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.