In CoE RAISE, innovative AI methods on heterogeneous HPC architectures capable of scaling towards Exascale are developed and generalized for selected representative simulation codes and data-driven workflows. This repository demonstrates some of our recent work.
This repository demonstrates training to reconstruct actuated Turbulent Boundary Layer (TBL) database using Convolutional AutoEncoder (CAE).
The TBL that develops on an airplane wing contributes about half of the total drag and massively defines trailing-edge noise [1]. A promising drag reduction strategy is to actuate the TBL, i.e., to introduce a motion to the surface that covers the wing [2]. However, to achieve a sufficient drag reduction, perfectly adjusted actuating parameters such as amplitude and frequency need to be identified. Wrong parameters lead to unwanted drag increases or too much energy spent on the actuation process. These parameters are usually scanned by performing a large number of computationally costly CFD runs. A small parameter scan is performed using the high-fidelity Large-Eddy Simulation (LES) method, where a dataset of 8.3 TB size is created.
The dataset is available with more information on the setup at https://www.coe-raise.eu/od-tbl.
The reconstruction of actuated TBL relies on CAEs. CAEs are unsupervised neural network models that summarize the general properties of the input dataset in fewer parameters while learning how to reconstruct the input again after compression, namely decompression [3]. Due to their simple implementation, CAEs are widely used for reducing the dimensionality of any dataset. The principle of the CAE model is at:
Training a CAE with large datasets is computationally challenging and can only be performed efficiently when parallelization strategies are exploited. A common parallelization strategy is to distribute the input dataset to separate GPUs, where the trainable parameters between the GPUs are exchanged occasionally. This method is called distributed data parallelization (DDP) and greatly reduces the training time. Depending on the size of the training dataset and data exchange rate between the CPUs and/or GPUs, this type of parallelized training can scale to very large systems (even exascale).
The only drawback of the data parallelization strategy is the loss in training accuracy that is caused by the increased total batch sizes. As the input dataset is distributed to separate workers, the total batch size linearly increases by the number of workers - even though the batch size per worker remains fixed. That is, in data-parallel training with a large number of workers, the batch size inevitably becomes large, which leads to reduced training accuracies. This limits the linear scaling performance of the CAE training and renders the training accuracy an important factor when investigating the scaling performance of a CAE training. The loss of training accuracy and how to cope with it has intensively been addressed in the literature (e.g., [4]) - tuning other hyperparameters such as the learning rate, batch size per worker, and the number of epochs can be adjusted to keep the training accuracy at a certain level.
We demonstrate here to use of a common open-source DDP framework library that comes with PyTorch, named Torch-DDP [5], to train actuated TBL dataset using the CAE method. Only 2 items are required:
DDP_pytorch_AT_CAE_LD.py
, DDP_pytorch_AT_Reg_LD.py
DDP_startscript.sh
.Detailed information on the code is either given through comments in the training script or over html/index.html
.
Note that the start script DDP_startscript.sh
is exemplary given for the training script to run on Jülich Wizard for European Leadership Science (JUWELS) Booster system at the Jülich Supercomputing Centre in FZJ [6], which uses SLURM workload manager [7].
In our training over 1000 epochs, the reconstruction of the vertical velocity qualitatively looks like:
where input denotes the original flow field and output denotes the reconstruction of it. A total of 128 GPUs are used to train this network on the JUWELS Booster system. A scalability analysis is also performed for this training on this system:
Our tests for a larger number of GPUs showed that the actuated TBL dataset size of 8.3 TB is not enough to fill more than 256 GPUs, with each having 40 GB of memory (NVIDIA A100 [8]). Hence, the node-to-node communication bottlenecked the parallel efficiency.
Do not hesitate to contact us for any comments and/or questions!
[1] https://doi.org/10.1017/S0022112004001855
[2] https://doi.org/10.1017/jfm.2011.507
[3] https://proceedings.mlr.press/v27/baldi12a.html
[4] https://arxiv.org/abs/1706.02677
[5] https://pytorch.org/docs/stable/notes/ddp.html
[6] https://www.fz-juelich.de/en/ias/jsc/systems/supercomputers/juwels
[7] https://slurm.schedmd.com
[8] https://www.nvidia.com/en-us/data-center/a100/