Designing Large-Scale Distributed Systems for Realistic Failure Models
There is a need for a better understanding of how large-scale distributed
systems fail, especially because such systems are being built
or have been deployed, and many more are on the way. The goal of this
project is twofold:
- Develop better failure models for real-world large-scale distributed systems and make datasets and analysis available
- Investigate how these failure models can be used to engineer systems with better dependability, availability, or performance.
Our investigations are in the context of enterprise desktop computing environments (i.e., running parallel applications on volatile resources), data grids (i.e., running scientific applications and data on high-end resources over the wide-area), and survivable systems (i.e., Internet systems that must be resilient to widepread outbreaks of Internet pathogens).
This project is funded by the National Science
Foundation and is a collaboration between the Computer Science and Engineering
Department at the University of
California, San Diego and the Information and Computer Sciences
Department at the University of
Hawai`i at Mānoa.