Micron & GIGABYTE CXL Workload Demo

Introduction

As technology advances, high-performance computing (HPC) and artificial intelligence (AI)-based services are increasingly integrated into everyday life. When discussing the enhancement of computational performance, attention is frequently directed toward CPUs and GPUs due to their significant processing capabilities. However, memory is equally essential to overall system performance, despite often receiving less recognition for its critical contributions.

Memory is where your computer temporarily stores the data it’s actively working on. The more memory you have—and the faster it is—the better your system can handle large, complex tasks. But here’s the problem: traditional memory technologies like DRAM (the kind found in most computers) are hitting limits. You can only fit so much of it on a motherboard, and high-capacity modules are expensive.

CXL is a new technology that enables computers to increase memory capacity by using PCIe connections, which are also used for devices like graphics cards and SSDs. While CXL memory may not match the speed of primary system memory, it can provide additional resources that support overall system performance.

This allows a system’s memory to be expanded beyond the physical limitations of the motherboard, providing a more cost-effective and adaptable solution.

Micron and GIGABYTE teamed up to run a series of real-world tests using GIGABYTE’s R284-A91-AAL3 CXL server, Micron’s CZ122 CXL memory expansion modules, DDR5 RDIMMs, and NVMe SSDs. The demo components are as follows:

GIGABYTE R284-A91-AAL3

2x Intel^® Xeon^®6 CPU
12-Channel DDR5 RDIMM
16x E3.S 2T CXL Expansion
4x E3.S Gen5 NVMe SSD

Micron CXL CZ122 Module

128GB/256GB Capacity
Fully Supports CXL 2.0
Type 3 Memory Expansion
E3.S 2T Form Factor

Micron DDR5 RDIMM

128GB Module Capacity
6400 MT/s Bandwidth
Innovative 1β technology

Micron 9550 NVMe SSD

15TB Storage Capacity
NVMe 2.0/OCP 2.0 Support
G8 TLC NAND
E3.S 1T Form Factor

We divided the tests into three categories, each highlighting a distinct benefit from CXL for clarity:
- CXL Memory Bandwidth Expansion
- CXL Memory Capacity Expansion
- CXL Cost Effectiveness

Software-Based Weighted Interleaving with CXL

Before evaluating the benchmarks, it is essential to consider a significant challenge related to CXL: its comparatively lower performance relative to direct-attached memory. Unlike conventional DRAM, which interfaces directly with the CPU via dedicated memory channels, CXL memory operates across the PCIe interface. This indirect connectivity results in increased latency, causing the CPU to require more time to access data stored in CXL memory as opposed to DRAM.

To capitalize on the additional CXL memory, a technique known as Software-Based Weighted Interleaving has been employed. This method efficiently balances data allocation between DRAM and CXL memory. To evaluate its effectiveness, the team utilized Intel’s Memory Latency Checker (MLC)—a tool designed to assess memory bandwidth and latency across varying workloads—and conducted microbenchmark tests using different read/write patterns and memory distribution ratios between DRAM and CXL.

Each test used a weighted interleaving approach, where memory pages were split between DRAM and CXL based on user-defined weights. For example:

A weight of 3:1 (DRAM:CXL) means 75% of memory traffic goes to DRAM, and 25% to CXL.

Imagine you’re driving from one city to another. There are four fast highway lanes (DRAM), but they’re getting crowded. Now you add a few slower side roads (CXL). Interleaving decides how to divide the traffic between them, so everything flows better. “Weighted” decides how much traffic takes the highway and how much traffic takes the side roads. It’s not just about speed—it’s about smart traffic control.

The weighted-interleaving feature, introduced in Linux kernel 6.9+, allows fine-grained control over memory allocation between DRAM and CXL memory. This enables optimized bandwidth utilization by assigning memory pages based on workload characteristics.

With this setup, data flow can be optimized even though DRAM and CXL have different latency and bandwidth.

Weight DRAM	Weight CXL	BW Norm.
1	0	1.00
2	1	1.12
5	2	1.25
3	1	1.28
Workload: R (Read-Only)

Weight DRAM	Weight CXL	BW Norm.
1	0	1.00
3	2	1.22
2	1	1.34
7	3	1.38
Workload: W2 (1R, 2W)

Weight DRAM	Weight CXL	BW Norm.
1	0	1.00
3	2	1.25
5	3	1.35
2	1	1.44
Workload: W5 (1R, 1W)

Weight DRAM	Weight CXL	BW Norm.
1	0	1.00
3	2	1.18
2	1	1.33
9	4	1.34
Workload: W10 (2R, 1W) NT

CXL Memory Bandwidth Expansion – Boosting Performance with More Bandwidth

The following section examines the effects of introducing CXL into practical workloads.

In applications that are intensive in memory usage, such as high-performance computing (HPC) and artificial intelligence (AI), increasing memory bandwidth can significantly enhance performance. This outcome was clearly observed in our results.

Performance across all four tested workloads increased by 22% to 33%, with the geometric mean of 28% increase across all HPC & AI workloads.

HPCG
Weight DRAM	Weight CXL	Performance (GFLOPS)	Increase
1	0	94.32	1.00
3	1	120.75	1.28

Pot3D
Weight DRAM	Weight CXL	Execution Time (s)	Speedup
1	0	706	1.00
5	2	539	1.31

CloverLeaf
Weight DRAM	Weight CXL	Execution Time (s)	Speedup
1	0	116.74	1.00
9	4	87.53	1.33

FAISS
Weight DRAM	Weight CXL	Output Token Latency (ms)	Speedup
1	0	2.28	1.00
2	1	1.87	1.22

Below is an overview of the workload we evaluated:

HPC Workloads
High-Performance Conjugate Gradients (HPCG)	Solves large and sparse linear systems using a multigrid preconditioned conjugate gradient algorithm. Scientific and engineering workloads with lots of memory access required.
Pot3D	Simulates the 3D Poisson equation: Molecular dynamics and physics problems that involve 3D electrostatic potential.
CloverLeaf	Solves compressible Euler equations on a grid. Astrophysics, nuclear simulations, industrial shockwave modeling.
AI Workload
FAISS	Uses Approximate Nearest Neighbor (ANN) search. AI workloads like recommendation systems, vector search, NLP embeddings.

CXL Memory Capacity Expansion – Scaling Up for Big Data

One advantage of CXL is its capacity to expand memory resources beyond the limitations of motherboard slots or the cost constraints of high-capacity DIMMs.

DuckDB, an analytical database engine, was used to test two benchmark suites: TPC-H and TPC-DS. TPC-H evaluates analytical queries on a simplified schema, while TPC-DS is a more complex benchmark designed to represent real-world retail database workloads with mixed query types.

Testing with DRAM+CXL+interleaving resulted in:
- 2.93x improvement on TPC-H
- 2.01x improvement on TPC-DS

These outcomes indicate that CXL may contribute to enhanced performance in decision support systems and big data applications. This shows that CXL isn’t just about speed, it’s about enabling bigger, more complex workloads that wouldn’t fit in memory otherwise.

Policy	Instances	Queries Per Minute	Max Memory (TB)	Perf. Gain
DRAM Only	2	2.52	1.44	1.00
DRAM+CXL Default	4	2.83	1.87	1.12
DRAM+CXL TPP	4	3.02	1.93	1.20
DRAM+CXL Interleave	8	7.38	2.86	2.93

Policy	Instances	Queries Per Minute	Max Memory (TB)	Perf. Gain
DRAM Only	4	1.84	1.56	1.00
DRAM+CXL Default	4	2.21	2.61	1.20
DRAM+CXL TPP	4	2.55	3.64	1.39
DRAM+CXL Interleave	6	3.70	3.77	2.01

CXL Cost Effectiveness – Saving Money Without Sacrificing Speed

Finally, it is important to consider cost efficiency.

We evaluated the performance of CXL memory by running the Deep Learning Recommendation Model (DLRM), which serves as a rigorous benchmark due to its high memory requirements and sensitivity to latency, particularly with large embedding tables. If CXL performs well under these demanding conditions, it suggests suitability for a wide range of workloads.

The observed performance impact was minimal:
- Approximately 2% degradation when 50% of memory was allocated via CXL
- Around 9% reduction when 67% of memory utilized CXL

In practical terms, this suggests organizations can significantly reduce expenses associated with purchasing large-capacity 128 GB RDIMMs—which may cost up to three times as much as 64 GB modules—while still maintaining most of the system performance. That’s a small trade-off for a big cost saving—especially at scale.

Configuration	DLRM Benchmark	Policy	Normalized Performance
1.5TB DRAM Only	17899	-	1
768GB DRAM + CXL	17555	SW Interleaving	0.98
512GB DRAM + CXL	16250	SW Interleaving	0.91

*Potential Cost Savings referenced may be subject to market change anytime, for more information please reach out to our Sales contact.

Conclusion

CXL is still a relatively new technology, but this demo proves it’s ready for real-world use especially when paired with smart memory interleaving methodology. Whether you’re building AI models, analyzing massive datasets, or running simulations, CXL offers:
- More memory bandwidth for faster performance
- More capacity to handle larger workloads
- Lower costs without major trade-offs

And with companies like Micron and GIGABYTE leading the charge, the future of computing is looking a lot more scalable—and a lot more efficient.

For any questions regarding the technology or products featured in our demonstration, please contact us.