Designing Large-Scale Distributed Systems for Realistic Failure Models


Project Overview

There is a need for a better understanding of how large-scale distributed systems fail, especially because such systems are being built or have been deployed, and many more are on the way. The goal of this project is twofold: Our investigations are in the context of enterprise desktop computing environments (i.e., running parallel applications on volatile resources), data grids (i.e., running scientific applications and data on high-end resources over the wide-area), and survivable systems (i.e., Internet systems that must be resilient to widepread outbreaks of Internet pathogens).

Collaboration

This project is funded by the National Science Foundation and is a collaboration between the Computer Science and Engineering Department at the University of California, San Diego and the Information and Computer Sciences Department at the University of Hawai`i at Mānoa.