| At the ConferenceExhibitsTransportationLodgingDiningNightlife | |||
![]() |
|||
SC Conference - Activity DetailsScalable Fault-Tolerant HPC Supercomputers Primary Session Leader:
Maria McLaughlin
(Appro International Inc.)
Secondary Session Leader:
High-performance computing systems consist of thousand of nodes and ten of thousand of cores, all connected via high speed networking such as InfiniBand 40Gb/s. Future systems will include a higher number of nodes and cores, and the challenge to have them all available for long scientific simulation run time will increase. One of the solutions for this challenge is to add scalable fault-tolerance capability as an essential part of the HPC system architecture. The session will review scalable fault-tolerant architectures and examples of energy efficient and scalable supercomputing clusters using dual quad data rate (QDR) InfiniBand to combine capacity computing with network failover capabilities with the help of Programming languages such as MPI and a robust Linux cluster management package. The session will also discuss how fault-tolerance plays in the multi core systems and what are the required modification to sustain long scientific and engineering simulation on those systems.
|
|||
|