ACM Distinguished Speaker

ACM ▪ 2019
ISCA Hall of Fame

IEEE TCCA and ACM SIGARCH ▪ 2018
IEEE Fellow

IEEE ▪ 2017
MICRO Best Paper Runner-Up Award

ACM International Symposium on Microarchitecture ▪ 2017
Outstanding mentoring of young faculty

College of Engineering, NCSU ▪ 2016
ISPASS Best Paper Award Nomination

ISPASS ▪ 2013
IPDPS Best Paper Award Nomination

IPDPS ▪ 2012
IEEE MICRO Top Picks

IEEE ▪ 2011
HPCA Hall of Fame

IEEE International Symposium on High-Performance Computer Architecture ▪ 2011
Faculty Partnership Award

IBM ▪ 2010
Global Center of Excellence Visiting Professor

Waseda University ▪ 2008
Faculty Partnership Award

IBM ▪ 2005
Best Paper Nomination

HPCA ▪ 2005
NSF CAREER Award

National Science Foundation ▪ 2004
AT&T Leadership Award

At&T ▪ 1997

Publications

Search Google Scholar for Yan Solihin

Memberships

ACM, Member; 2017 - present

IEEE, Fellow; 2003 - present

Peer Review Positions

Program Chair, HPCA; 2020 - 2020

Program Director, National Science Foundation; 2015 - 2018

Technologies

Crash Recovery Improvements in Non-Volatile Main Memory (NVMM)

Abstract

Researchers at the University of Central Florida and North Carolina State University have developed a way to reduce the execution time and write amplification associated with restoring data from non-volatile main memory (NVMM). Current crash recovery solutions use logging or checkpointing to provide failure safety to applications. However, these solutions are for volatile main memory and non-volatile disks, not NVM-based systems. As a result, they incur much higher execution time and write endurance overheads. Existing technologies also require specific hardware support or instruction set architecture (ISA) support to recover NVMM-stored data. These forms of support are not readily available in most machines today.

In comparison, the UCF technology provides unique data writing/backup methods to effectively use persistent main memory so that data recovery after a crash is faster and more accurate. The approach uses two main methods: recomputation and lazy persistency (LP). This combination avoids the need to expend large amounts of energy to rewrite lost data. Companies can rewrite their software and run them on any hardware platform to obtain the system recovery benefits.

Technical Details

The UCF invention comprises methods for accelerating program execution on NVM while at the same time reducing the number of writes. Included are steps for organizing a set of instructions into multiple regions. At least one of the regions is a recovery unit, and another is an error checking unit. The recovery unit includes written data to be transferred to NVMM, while the error checking unit summarizes the written data into a value.

One key aspect of the invention relaxes requirements for data consistency in logging and checkpointing schemes. Instead, it allows data to be in an inconsistent state during some phases of a program's lifetime by only logging enough state to enable recomputation. When a failure occurs, the approach recovers to a consistent state by determining which parts of the computation were incomplete and then recomputes them. Another aspect is the use of LP, a software persistency method. LP exploits the natural cache evictions to provide persistency without the need to eagerly flush cache blocks from the cache to the NVMM. Thus, the technique allows caches to slowly send dirty blocks (that is, modified and unsaved data) to the NVMM through natural evictions. Software error detection mechanisms (checksums) enable the system to discover persistency failures. Compared to the state-of-the-art Eager Persistency technique, LP reduces the execution time and write amplification overheads from 9 percent and 21 percent to only 1 percent and 3 percent, respectively.

Stage of Development

Prototype available.