North American Systems
International
Where Innovation
Meets Solutions
Improving reliability, availability, & serviceability with IBM’s Power Systems IBM POWER6-based systems are designed to maximize reliability, availability & serviceability with a variety of unique features. Its new check point system includes a completely redesigned recovery unit, error logger, and restart mechanism to transparently catch errors.
IBM’s POWER6 technology employs mainframe-class error detection technology that continually monitors systems to potential problems. Service processors are dedicated to monitoring for faults, and once faults are detected, the systems automatically work to isolate the problem with First Failure Data Capture (FFDC), and initiates any of a wide range of recovery mechanisms to correct the problem before it affects operations.
When a component is flagged for deallocation, the hardware causing the fault can be removed while the system is running, or at boot-time. The POWER6 systems feature dynamic deallocation of PCI-X and PCIe slots and extended error handling. For 2-core systems and up, dynamic deallocation of processors is supported.
IBM’s POWER6 systems are designed to automatically retry failed instructions to continue executions. The chip records all data its storing, and if an error is detected, it reverts to it previous state to retry the processing step.
If retry fails, it gets moved to a new processor, which is proactively chosen by a specific set of criteria, searching for spare unlicensed CoD processors, and if none are available, attempts to move it to an unused processor. If all processors are used, the POWER6 dynamically allocates resources on a processor in use but has additional compute capacity available.
Redundancy is built-into much of the POWER6 architecture, even the power supplies to the processor are redundant. The IBM Power Systems
Memory & cache arrays are protection with a good architecture with error resilience. ECC enables systems to detect errors and correct them automatically, and IBM’s enhanced ChipKill™ technology enables systems to sustain failures in entire DRAM with features such as Redundant Bit Steering. In addition, the POWER6 architecture supports ECC for L2 & L3 cache.