Research Terms
Computer Architecture Computer Software Computer Security Parallel Processing Systems Integrated Hardware Software Systems
Industries
Military IT Software & Computer Systems Design Specialized Logistics IT
ACM, Member; 2017 - present
IEEE, Fellow; 2003 - present
Program Chair, HPCA; 2020 - 2020
Program Director, National Science Foundation; 2015 - 2018
Researchers at the University of Central Florida and North Carolina State University have developed a way to reduce the execution time and write amplification associated with restoring data from non-volatile main memory (NVMM). Current crash recovery solutions use logging or checkpointing to provide failure safety to applications. However, these solutions are for volatile main memory and non-volatile disks, not NVM-based systems. As a result, they incur much higher execution time and write endurance overheads. Existing technologies also require specific hardware support or instruction set architecture (ISA) support to recover NVMM-stored data. These forms of support are not readily available in most machines today.
In comparison, the UCF technology provides unique data writing/backup methods to effectively use persistent main memory so that data recovery after a crash is faster and more accurate. The approach uses two main methods: recomputation and lazy persistency (LP). This combination avoids the need to expend large amounts of energy to rewrite lost data. Companies can rewrite their software and run them on any hardware platform to obtain the system recovery benefits.
Technical Details
The UCF invention comprises methods for accelerating program execution on NVM while at the same time reducing the number of writes. Included are steps for organizing a set of instructions into multiple regions. At least one of the regions is a recovery unit, and another is an error checking unit. The recovery unit includes written data to be transferred to NVMM, while the error checking unit summarizes the written data into a value.
One key aspect of the invention relaxes requirements for data consistency in logging and checkpointing schemes. Instead, it allows data to be in an inconsistent state during some phases of a program's lifetime by only logging enough state to enable recomputation. When a failure occurs, the approach recovers to a consistent state by determining which parts of the computation were incomplete and then recomputes them. Another aspect is the use of LP, a software persistency method. LP exploits the natural cache evictions to provide persistency without the need to eagerly flush cache blocks from the cache to the NVMM. Thus, the technique allows caches to slowly send dirty blocks (that is, modified and unsaved data) to the NVMM through natural evictions. Software error detection mechanisms (checksums) enable the system to discover persistency failures. Compared to the state-of-the-art Eager Persistency technique, LP reduces the execution time and write amplification overheads from 9 percent and 21 percent to only 1 percent and 3 percent, respectively.
Stage of Development
Prototype available.