cuda_memtest is a tool designed to test GPU memory for errors, both hardware and software-related.
The software operates by running kernels that write specific patterns to the GPU memory and subsequently verifying the integrity of these patterns after a read-back. It checks for mismatches and records errors if any inconsistencies are detected. Although initially developed for CUDA, it also has support for AMD GPUs via HIP.
cuda_memtest is particularly useful for verifying the stability of GPUs in high-performance computing environments where memory reliability is critical, such as scientific simulations and machine learning tasks. It features 11 distinct tests, including walking bit tests, random pattern checks, and memory stress tests, which help in identifying different types of memory faults.
Related projects
PIConGPU
PIConGPU is an extremely scalable and platform-portable application for particle-in-cell simulations. While we mainly use it to study laser-plasma interactions, it has also found utility in astrophysical studies and simulations of matter under extreme conditions.
cuda_memtest is executed before any simulation to ensure the accelerators works without any memory failures.