cuda_memtest

Tests GPU memory for hardware errors and soft errors using NVIDIA's CUDA or AMD's HIP.

1
contributor
60 commits | Last commit 1 month ago

What cuda_memtest can do for you

cuda_memtest is a tool designed to test GPU memory for errors, both hardware and software-related.
The software operates by running kernels that write specific patterns to the GPU memory and subsequently verifying the integrity of these patterns after a read-back. It checks for mismatches and records errors if any inconsistencies are detected. Although initially developed for CUDA, it also has support for AMD GPUs via HIP.

cuda_memtest is particularly useful for verifying the stability of GPUs in high-performance computing environments where memory reliability is critical, such as scientific simulations and machine learning tasks. It features 11 distinct tests, including walking bit tests, random pattern checks, and memory stress tests, which help in identifying different types of memory faults​.

Related projects

PIConGPU

PIConGPU is an extremely scalable and platform-portable application for particle-in-cell simulations. While we mainly use it to study laser-plasma interactions, it has also found utility in astrophysical studies and simulations of matter under extreme conditions.
cuda_memtest is executed before any simulation to ensure the accelerators works without any memory failures.

Keywords
Programming languages
  • C++ 91%
  • CMake 7%
  • C 2%
License
</>Source code

Contributors

Related software

PIConGPU

PI

PIConGPU is a relativistic Particle-in-Cell code running on graphic processing units as well as regular multi-core processors. It is Open Source und is freely available for download. It can be used to study plasmas with relativistic dynamics, solving the Maxwell-Vlasov system of equations.

Updated 3 weeks ago
3 11