HPC

GIGABYTE Powering the Next Generation of HPC Talent at ISC

Jul 02, 2019

GIGABYTE's booth at ISC 2019

Apart from the vendor exhibition and conference, one of the highlights of ISC every year is the SCC (Student Cluster Competition). Now in its eighth year, the SCC is a real-time computing performance contest held over the 3 days of ISC, featuring fourteen student teams from different universities around the world, and focuses on advancing STEM disciplines and HPC skills development. The teams need to build a HPC cluster of their own design, and then run a series of standard HPC benchmarks and applications while adhering to strict power constraints (staying under a total power consumption limit of 3000W).

GIGABYTE was honored to have our computing systems being chosen this year by four different student teams to build their HPC clusters. Even more interesting was that these four teams built their clusters using four different CPU platforms, illustrating the diversity that GIGABYTE can provide in terms of different system architectures. These four teams were:

1. Tartu Team (The University of Tartu, Estonia)
The Tartu Team built their HPC cluster with AMD EPYC platforms, using 4 x R281-Z94 and 2 x G291-Z20 GIGABYTE servers.
2. HPC Team RACKlette (ETH Zürich, Switzerland)
Team RACKlette built their HPC cluster on an Intel Xeon Scalable platform with 4 x G291-280 GIGABYTE servers. RACKlette achieved a great result – achieving the top score for the LINPACK benchmark, and placing 3rd overall in the competition.
3. National Cheng Kung University Team (Taiwan)

The team with one of the most unorthodox clusters, building the server platforms themselves with GIGABYTE X299 motherboards complete with RGB lighting. Perhaps its why they won the fan favorite award this year!
4. UPC Les Maduixes (Universitat Politecnica de Catalunya, Catalonia Spain)

Les Maduixes chose to built their cluster using a Marvell ThunderX2 platform, with 8 x R281-T94 servers to Illustrate that an Arm CPU architecture is also a viable choice to build a HPC cluster.

National Cheng Kung University team working hard

UPC Les Maduixes with their GIGABYTE / Marvell ThunderX2 Arm HPC cluster

Interview with Team RACKlette (ETH Zürich, Switzerland)

1. Can you let us know in more detail the hardware and software stack you used for the cluster?
Inside each of our 4 x GIGABYTE G291-280 servers we were running dual Intel Xeon Platinum 8180 CPUs, making up to 8 CPUs in total. Two of our nodes were additionally equipped with 4 x NVIDIA Tesla V100 GPU accelerators for optimal performance on GPU accelerated applications, and also making for a total of 8 GPUs in the system. All the nodes were connected with an 100Gbit/s Infiniband EDR interconnect from Mellanox using an Infiniband network switch. An additional ethernet network was used for deploying, running and monitoring. To save on power our storage setup was kept at a minimum, with boot SSDs and a single shared SSD storage drive on our head node. The whole system has been assembled and set up with the help of our advisors from CSCS (Swiss National Supercomputing Centre) and our system integrator and sponsor Dalco from Switzerland.

ISC19 Meet The Teams interview: ETH

On top of this hardware stack we deployed the Bright Cluster Management system with CentOS 7 and configured one of our CPU-only nodes as a head node, whereas the other three nodes remained as compute nodes. The Bright Cluster Management system comes with an extensive toolset for deployment, maintenance, job scheduling, monitoring, backups and much more. This allowed us to establish an efficient workflow, monitor our hardware accurately and tune for optimal performance with very little cost in power and performance. To further increase our productivity we made extensive use of HPC oriented packet managers such as Spack. This allowed us to easily deploy dependencies in many different versions and with various degrees of optimizations enabled, to tune for power efficiency and performance, while maintaining accuracy for all computations.

2. What were the reasons why you choose this hardware (and software) stack design?
We were trying to design a system which would be competitive for a wide range of applications. After some rough initial testing and considering the 3000W power limit we decided on a four node system already at an early stage. To achieve top performance we opted for a combination of Intel CPUs paired with NVIDIA accelerators which would put us into a very competitive position. To balance the system for all the applications and based on power measurements during benchmarking, we decided on the final number of 8 CPUs and 8 GPUs. This hardware setup would allow us to exhaust our 3000W power budget on GPU based applications, as well as on CPU-only based applications.

ETH team photo

3. What were some of challenges you met during the competition?
Our team prepared well and we had all the applications up and running at the beginning of the competition. However, we still ran into a couple of hurdles during the competition itself, where we had to deal with new input sets. We faced some issues with the Swift and OpenFOAM simulation software. The MPI implementations we used threw up various errors messages and initially we were not able to run the applications on the competition day. Yet after intensive debugging and impatiently waiting for recompilation with fresh and adapted dependencies, we were able to run and submit a decent result for every application.

Another challenge we faced at the competition was the limit of 3000W and the penalties we would receive if we were to cross this strict limit. Even though we trained and optimized our hardware and software for this upper bound, we ran into complications on applications for which we added last minute performance optimizations. These optimizations obviously changed the power consumption behavior of the applications and we experienced an unexpected spike over our 3000W limit during the first day of the competition. However, we quickly learned from this and adapted our job scheduling to account for potential spikes after such last minute optimizations.

4. Congratulations for achieving the top score of the LINPACK benchmark! What were some of the key factors that allowed you to reach the best result?

Our actual intent was to build a balanced system performing well on all the applications. We did not want a single purpose cluster built for LINPACK alone. However, as LINPACK is a very important benchmark, we were nonetheless determined to achieve a competitive result using only our 8 NVIDIA Tesla V100 GPUs. The GPU friendly layout of the GIGABYTE servers allowed us to do extensive testing before the competition to determine the optimal GPU layout for the highest LINPACK performance. While we started with a 4 node with 2 GPUs each configuration (4x2), we ended up with a 2x4 setup and after also having tested the 1x8 layout. Besides extensive testing for the optimal input parameters of the LINPACK benchmark, we turned every knob in the system: on one hand to save power on components not needed for the run itself, but on the other hand pushing the crucial parts (like the GPUs) as high up as possible just below the power cap. So we ended up spending a lot of time looking for optimal CPU frequencies, fan speeds, GPU frequencies, sleep modes and power states of other system components and many other things. In the end our hard work paid off and we were able to achieve the highest LINPACK run with only 8 GPUs competing against teams with much more compute power in their system. Besides our rigid testing and tuning, an essential key to this achievement was our choice to come with a smaller cluster of 4 nodes, where compute power was more densely packed into these 4 nodes. This kept our idle consumption low and allowed us to run the GPU nodes at a higher performance level.

ETH HPC cluster using 4 x GIGABYTE G291-280

5. What were some things you learned or how will you change your strategy to achieve a better result at SC19?
We have already started discussing our potential hardware setup for SC19. Our goal will again be to build a balanced system which will perform well for all the applications we need to run. We learned to use many tools during our preparation for ISC19 - such as Bright Cluster Management and the package manager Spack - and we will continue to use them. To achieve better results at SC19 we will try to further improve our monitoring setup. Our team did not have a lot of time to work with advanced power consumption measurement on the cluster in the state which it was used at the competition. It has been made a priority to train on a comparable setup with more accurate power measurements. To further improve on this, we will try to add more automated testing to our workflow, to get the last small bit of performance out of our cluster and to be able to try out more configurations than would be possible manually. Also, our experience with HPC software has grown a lot during the past preparation for ISC19 and we will try to get more adventurous, when it comes to optimizations and for example look into hand curated source code optimizations, from which we abstained so far for stability reasons.

Interview with Team Tartu (University of Tartu, Estonia)

1. Can you let us know in more detail the hardware and software stack you used for the cluster?
At first, we had 4 x CPU nodes and 2 x GPU nodes with 4 NVIDIA V100 GPUs, but later we took out 4 x V100 GPUs from one GPU node and added them to the second one. Here is the final configuration:

4 x GIGABYTE R281-Z94 CPU nodes

Each node featured: 2 x AMD EPYC 7601 @ 2.2GHz (32 cores) / 128 GB RAM, / 2 nodes had 1 x SATA SSD 240GB and the other 2 had 1 x NVMe M.2 460GB SSD / EDR Infiniband
1 x GIGABYTE G291-Z20 CPU node
Featuring: 1 x AMD EPYC 7601 @ 2.2GHz (32 cores) / 256 GB RAM / 1 x NVMe M.2 460GB SSD / EDR Infiniband
1 x GIGABYTE G291-Z20 GPU node
Featuring: 1 x AMD EPYC 7601 @ 2.2GHz (32 cores) / 256 GB RAM / 1 x NVMe M.2 460GB SSD / EDR Infiniband / 8 x NVIDIA Tesla V100 with 32GB of VRA
Switch:

IB-2 SB7800 36 ports EDR
Software:

OS: CentOS 7.6 / MPI: Openmpi 3.10, Openmpi 3.12 / CUDA: 10
Also, we used Nvidia HPL and HPCG binaries.

ISC19 Meet The Teams interview: Tartu

2. What were the reasons why you choose this hardware (and software) stack design?
AMD was our first choice for CPUs because AMD has a large memory bandwidth, which is beneficial in many applications that require the movement of large chunks of data. Thus, we could have an edge over other competing teams. Our team wanted to challenge ourselves with the optimization tasks for AMD CPUs and gain more knowledge and experience in the parallel computing world. AMD also has open source libraries that are interesting to look at.

Furthermore, we are not only interested in looking at computing performance but also considering the CPU market. In our opinion, AMD has a strong advantage since its processors are much cheaper than comparable ones with the same performance. In other words, AMD could offer computational power of supercomputers at lower enterprise prices, which can reduce the computing center expenditure.

Since coming to the competition with AMD is an adventure, using NVIDIA GPUs were a safe choice for us on this journey. NVIDIA has a substantial community where we could receive the needed support. They also have many available sources (libraries/compilers) for nearly all major applications. As we did not know which features of simulation software would appear at the competition, we anticipated the significant probability that any feature would still be supported by NVIDIA GPUs (CUDA, OpenACC, OpenCL) compared to other choices.

University of Tartu team photo

3. What were some challenges you met during the competition?
In a competition unexpected problems can come up. Reacting to these and adapting quickly can be quite a challenge, but that is part of the fun of the competition! Dealing with these problems while having limited time and resources and accommodating all tasks can be quite tricky, especially when trying to balance different tasks accordingly to achieve a great position on the leader board. For some of these challenges, time to prepare before the competition would have allowed us to better use our time during the event. This was especially true for managing power consumption while sharing the system. The part of our team concerned with the AI-workloads were always hungry to soak up all unneeded resources for their training and hoped to be able to dedicate a few GPUs to this, while still giving the rest of the team the capabilities to do their part of the work. Checkpointing for this can be quite costly, as we learned.

4. What were some highlights / achievements for your team in competition?
We managed to assemble the cluster quickly and were able to run and compile every application, which gave us great confidence and time to tune for performance. The teamwork required for this event is simply amazing, and taught us a very positive way of collaborative problem solving. A great thing about the competition, in general, is to get to meet fellow students interested in similar areas while still having diverse backgrounds. In addition to that, we have had opportunities to build connections to the industry. This is not only interesting, now, but might be very beneficial for the future, as well.

University of Tartu HPC cluster using 4 x GIGABYTE R281-Z94 & 2 x GIGABYTE G291-Z20

5. What were some things you learned or how will you change your strategy to achieve a better result at SC19?
After each competition you gain a bit more knowledge about which problems you might encounter, to better prepare for these situations next time. It is best to have a known working solution and have several backup plans, for potential problems.

Preparation is key. We have great hardware, but evaluating some alterations, as well as finding the best configuration for each benchmark before the next competition will help us bringing a better design and setup, while spending the time at the competition to tune for performance. Last minute changes often hurt performance a lot, so we should have known solutions.

Of course, knowing the tools of the trade is really important. This competition taught us quite a bit about Tensorflow, which was not in our direct focus before. We will be able to improve since we have a better and better overall picture of what we are dealing with for all parts of the competition for setup to hardware to software and applications.

Conclusion

Having our hardware platforms used by four different teams at this year's ISC Student Cluster Competition (SCC) definitely illustrates that GIGABYTE is well on the way to becoming one of the premier server hardware brands for HPC, offering systems featuring dense CPU and GPU configurations to deliver maximum performance based on a variety of different platforms - x86 (Intel, AMD) and Arm (ThunderX2). Our sponsorship of these teams also demonstrates that GIGABYTE is deepening our commitment to nurturing the next generation of HPC talent.

Congratulations to all 14 teams at this year's ISC competition for your hard work and to the overall winner - Team CHPC (Center for High Performance Computing, South Africa). We are looking forward to seeing you all again soon at SC19 in Denver!