Abaqus 2018: Parallel Processing & GPGPU

A set of interesting benchmarks on the scalability of Abaqus/Standard 2018, including parallelization and GPU acceleration
Continuing our discussions and testing of the scalability of Abaqus solver (Abaqus 2017 & NVIDIA Quadro GP100, Tesla K40 gives second life to ancient server, GPU Computing with Abaqus 2016), we will be looking at the current release, Abaqus 2018.HF2. As always, the goal of our benchmarking is to evaluate what can realistically be expected from fairly powerful workstation configurations running Abaqus.

Our series of test will involve two pieces of technology at which Abaqus excels: parallel processing and GPU computing. To perform our tests, a single model will be used, consisting of around 4M dof. The solution will consist of a 2-step non-linear solution, pre-stress and surface-surface contact simulating insertion and extraction of tightly fit defense electronics RF connectors in its socket with the Abaqus/Standard solver.

RF Connector Analysis

Although seemingly trivial, these types of analyses are complex and compute-intensive as friction, many local phenomena and multiple types of materials are involved.

The hardware used in this article is a Dell Precision T7910 workstation running Windows 10 Pro for Workstations Edition sporting 128GB RAM, dual liquid-cooled 6-core Xeon E5-2643 v4 @ 3.4GHz and an NVIDIA Quadro GP100 combined Graphics & Compute card. A nice engineering workstation indeed!

Dell T7910 NVIDIA Quadro GP100

Parallel Processing: cores, threads & hyperthreading

It is a well known fact that Abaqus, being Abaqus/Standard or Abaqus/Explicit, scales very well with the number of cores assigned to a particular solution. This scalability is true for a simple workstation and extends to HPC environments (read our success story with Abaqus/Explicit on Rescale Cloud HPC). The common wisdom among Abaqus users is to avoid Hyperthreading at all cost in fear of performance degradation , actually turning it off in the BIOS. This statement doesn't necessarily hold true for all CAE applications, and diverse recommendations are given by different software vendors (Siemens, ANSYS, MSC,...). Does this still hold true? Does it apply to the higher-end Xeon (we know it doesn't apply to the AMD EPYC processor, this will be the object of another in-depth article)?

We put this through its paces with our Dell workstation: leaving hyperthreading on, we ran our model with different Abaqus/Standard parallelization configuration from no parallelization (1 core) all the way to 12 cores (the maximum for this workstation), then 16 and 24 via hyperthreading. The overall scalability results can be seen on the graph below, showing wallclock time (in seconds) vs. parallelization of the Abaqus/Standard solver.

Abaqus CPU Benchmark

The initial run, on one core, took over 9h to complete, a serious candidate for solver acceleration! The same job running on two cores finished in less than 5h, 4h shaved off the solution time by using 2 cores! Similarly, doubling the core count ( now running on 4 cores) gives us a solution time under 3h. With 8 cores, our solution time is now under 2h, while 12 cores takes us to 1h30. Abaqus/Standard seems to scale very well on our workstation, allowing us to perform 3x to 4x the same amount of simulation per day with a 4-way parallelized solve compared to a standard solve. The jump in cost from 5 tokens to 8 pays for itself in a day or two, unbeatable ROI!

Now, when it gets interesting is once we started wandering in the Hyperthreading zone... With our dual 6-core Xeon, Hyperthreading essentially give us 24 logical processors to play with. We ran our model with 16 and 24 threads, fully expecting significant performance degradation as seen with other CAE applications. To our surprise, the solution time continued decreasing, scraping off 24 minutes with 16 threads, and another 4 minutes with 24 threads. The final solution time, with 24 threads using hyperthreading is 1.4x faster than with 12 cores. Hyperthreading on Xeon processors isn't quite as penalizing as thought, at least in our case.

GPGPU: Abaqus/Standard acceleration using the GPU

The best way to understand the basic principle of GPGPU is to watch Mythbuster's demonstration Adam and Jamie explain parallel processing on GPU's:

Essentially, Abaqus is able to re-route calculation instructions that are typically processed by the CPU to the GPU. The main advantage is the massively parallel nature of the current GPUs. In our case, the NVIDIA Quadro GP10 offers 3584 cores to process the Abaqus instructions while our Xeon workstation CPUs offer a total of 12 cores (24 with Hyperthreading). The same set of computing test presented earlier were conducted with the addition of the GPU acceleration to evaluate the benefits. A complete set of results, along with acceleration multipliers is presented below.

Abaqus 2018 GPGPU Benchmark

Our initial solve of 9h on 1 core sees a 3x speedup with the GP100, allowing us to solve this model in less than 3h on 1 core and 1 GPU. This is barely slower (10 minutes) than our solve on 4 cores! Similarly, our 2 core + 1 GPU is solved in a little over 1h30, versus around 5h without the GPU! The GPGPU acceleration continues steadily, still providing a 2x solution speedup over a 24 threads parallelized solution. Our 9h solution time one 1 core can now be solved in 37min using 24 core and the GPU!

Key Takeaways

This set of experiments, conducted over several days, confirmed a lot of what we knew about Abaqus, but also brought some surprising new developments (or at least confirmed our suspicion):

  • Abaqus/Standard scales very well with the number of cores, at an average of 1.6x performance enhancement per 2x core count increase,

  • Hyperthreading doesn't seem to be penalizing, as we observed an 1.4x performance gain from 12 cores to 24 threads (12 cores hyperthreading),

  • GPGPU brings performance boost of up to 3x, and complement CPU parallelization very well, even in high core count solutions with a 2x performance gain.

blog comments powered by Disqus