How to Pick the Right Server for AI? Part Two: Memory, Storage, and More
The proliferation of tools and services empowered by artificial intelligence has made the procurement of “AI servers” a priority for organizations big and small. In Part Two of GIGABYTE Technology’s Tech Guide on choosing an AI server, we look at six other vital components besides the CPU and GPU that can transform your server into a supercomputing powerhouse.
In Part One of our Tech Guide, we looked at some helpful tips that will help you choose the correct central processing units (CPUs) and graphics processing units (GPUs) for your AI server. While processing prowess is paramount, there is more to an AI computing platform than these two components. In this section, we look at how memory, storage, power supply units (PSUs), thermal management, expansion slots, and I/O ports may affect the performance of your server, and how you may choose the right ones for working with AI.
Also known as RAM, memory is used in a server to store programs and data for the processors’ immediate use. Since the most powerful AI chips can compute a lot of data very quickly, it wouldn’t do to hamstring their performance with inadequate memory. The server’s memory must always have enough throughput and capacity to support the processors.
Currently, the most advanced type of memory is DDR5 SDRAM, which is the fifth generation of Double Data Rate Synchronous Dynamic Random-Access Memory; we’ll call it DDR5 for short. It offers higher data transfer rates, higher bandwidth, lower voltage requirements, and more capacity than previous generations, making it the memory component of choice for topline AI servers.
Obviously, one RAM stick (more correctly called a DIMM) will not be enough. Make sure that your AI server has enough DIMM slots to satisfy the requirement of your workload. For example, GIGABYTE’s G493-ZB3, a G-Series GPU Server designed for AI training and inference, has an impressive forty-eight DIMM slots. The DIMMs themselves may be designed to optimize speed, stability, and capacity. Examples include certain subsets of DIMMs, such as RDIMMs (registered DIMMs) and LRDIMMs (load-reduced DIMMs).
Last not least, the server processors may have ways to streamline memory use. GIGABYTE’s AI training powerhouse, the G593-SD0, supports the Intel® Xeon® CPU Max Series of processors, which feature High Bandwidth Memory (HBM) for improved memory use in HPC and AI workloads. The XDNA™ architecture used by AMD GPUs boasts an adaptive dataflow architecture that allows data to pass through the layers of an AI model without relying on external memory.
How to Pick the Right Storage for Your AI Server?
While memory stores data for immediate use, storage retains all of the server’s data permanently, until the user deletes it. The three criteria for you to consider are speed (i.e., data transfer rates and bandwidth), storage capacity, and whether the device is compatible with the “third pillar of modern data centers” (in addition to the CPU and GPU), the DPU.
There are a lot of acronyms to remember, so stay with us. First of all, it goes without saying that solid-state drives (SSDs) have long outpaced hard disk drives (HDDs) as the superior storage device and should certainly be used in your AI server. There are three types of storage interfaces: SATA, SAS, and NVMe. SATA is the most established technology and was initially designed to be used with HDDs. SAS is faster than SATA, but the champion is NVMe, which can only be used with SSDs. Because NVMe utilizes PCIe tech to improve read/write speeds and up the bandwidth between storage devices and processors, the result is faster data transfer rates and lower latency. Hence, SSDs using the latest Gen5 NVMe interface are the number one choice for storage devices in AI servers.
The next attribute to consider is capacity. Broadly speaking, an NVMe SSD either adopts the smaller M.2 form factor or the more ubiquitous enterprise-grade 2.5-inch form factor. GIGABYTE’s comprehensive line of AI Servers primarily utilizes 2.5” storage bays due to the larger capacity and hot-swappable design, which allows the bays to be conveniently removed or replaced without having to power down the server. Additional M.2 slots are also available on many server models.
Last but not least, some of GIGABYTE’s AI Servers, such as the H223-V10H-Series High Density Server powered by the NVIDIA Grace Hopper™ Superchip, can support supplementary 2.5" Gen5 NVMe hot-swappable storage bays by adding NVIDIA BlueField-3 DPUs to the expansion slots. This is an exciting new feature that should be taken into account when you are comparing options for your AI server’s storage bays.
While memory and storage serve different functions, there are comparable rules of thumb when it comes to choosing the right ones for your AI supercomputing platform.
How to Pick the Right Power Supply Unit for Your AI Server?
The server’s power supply unit (PSU) provides a safe, stable source of electricity for the server to run on. Because AI workloads tend to be compute-intensive, it is imperative to choose a configuration of PSUs that offers exceptional power efficiency and redundancy.
The best way to check the PSU’s power efficiency is through the certification program known as 80 PLUS. This program separates PSUs into six different levels based on energy efficiency, with 80 PLUS Titanium being the most efficient. At this level, conversion efficiency (that is, how much energy input is converted into useful output) is between 90% and 96%, to put it concisely. The second highest level is 80 PLUS Platinum, where conversion efficiency is between 89% and 94%. GIGABYTE’s AI Servers predominantly use 80 PLUS Titanium-certified PSUs.
Another thing to remember is that redundancy is essential. The server should remain operational even if one or more of the PSUs go down. GIGABYTE’s AI Servers are designed with the appropriate number of redundant power supplies. Some servers can continue normal operations even if half of its PSUs go offline.
How to Pick the Right Thermal Management for Your AI Server?
It goes without saying that all the components inside the server produce a lot of heat. Choosing the correct thermal management or heat dissipation tools is important if you want to get the best performance out of your server without hiking up your electricity bill.
The traditional method of keeping the server cool is air cooling. In other words, fans are installed in the server to pump the hot air out into the data center’s aisles. All of GIGABYTE’s AI Servers adopt a proprietary airflow-friendly hardware design. The direction of the airflow in the chassis has been evaluated with simulation software to optimize ventilation. High-performance fans and heat sinks are installed to further enhance heat dissipation. An automatic fan speed control program monitors the temperature at critical points in the chassis and adjusts the speed of the corresponding fan(s) accordingly. Fan speed profiles can also be manually tweaked to achieve the right balance between thermal management and power efficiency.
Certain AI servers, like GIGABYTE’s G363-SR0, which is a GPU Server that is integrated with the NVIDIA HGX™ H100 4-GPU module, also supports liquid cooling. This is an innovative new method of thermal management that pumps liquid coolant through cold loops that coil around key components in the server and absorb the heat. Liquid cooling has the potential to unleash the full potential of processors while improving the data center’s overall PUE.
The pinnacle of liquid cooling is immersion cooling, which submerges the server directly in a bath of nonconductive, dielectric fluid. GIGABYTE provides both single-phase and two-phase immersion cooling solutions. For example, the A1P0-EB0 is a one-stop immersion cooling solution designed for standard 19-inch EIA servers, while the A1O3-CC0 is designed for OCP servers. GIGABYTE’s AI Servers may be modified to work with these advanced cooling methods that will optimize TDP while further improving overall PUE.
Here are some simple guidelines on how to choose power supply units, thermal management, expansion slots, and I/O ports for your AI server.
How to Pick the Right Expansion Slots for Your AI Server?
Since scalability—the wiggle room to expand your computing tool kit when the need arises—is important, you shouldn’t forget to pay attention to your AI server’s expansion slots. There are no wrong choices, but it’s helpful to keep a couple of pointers in mind.
First, look for PCIe Gen5 slots—the more the merrier. The bandwidth of PCIe Gen5 is 128 GB/s and its data transfer rate is 32 GT/s; both are a 100%-increase over the previous generation. These slots will allow you to add additional graphics cards, RAID cards—even the aforementioned DPUs, which can handle data transfers, data compression, data storage, data security, and data analytics for the CPU, further improving the server’s performance.
In addition to the bus standard, there’s also the physical space that’s available in the chassis. You’ll see acronyms like FHFL (full-height, full-length) and HHHL (half-height, half-length), which is the same as LP (low-profile). These descriptions denote the dimensions of the cards that the expansion slots are designed to work with. It’s worth noting that while a smaller card may fit in a slot designed for something larger, the opposite is obviously impossible. Therefore, you must make a choice between the versatility of the slots and the compute density you want to achieve. OCP mezzanine slots, which are necessary for OCP networking and storage cards, should also be available in your AI server if there’s a chance that you’ll work with these add-ons.
How to Pick the Right I/O Ports for Your AI Server?
The last thing to contemplate in your AI server is how it will connect to external devices, such as switches, displays, and other servers. As always, the guiding principle is to try to have more of the most advanced tech. Aim for LAN ports that support 1Gb/s or even 10Gb/s transfer rates, USB 3.0 or higher (such as USB 3.2), the works.
You can also pay attention to see if your server has dedicated management ports, also known as MLAN. These provide secure access to the server’s BMC, which can come in handy if you want a more convenient way to manage your server. Once everything is in place, you will have a supercomputing platform that is ideally suited to your AI workload.
Thank you for reading GIGABYTE’s Tech Guide on “How to Pick the Right Server for AI? Part Two: Memory, Storage, and More”. We hope this article has been helpful and informative. For further consultation on how you can benefit from AI in your business strategy, academic research, or public policy, we welcome you to reach out to our sales representatives at email@example.com.