Component

Micron & GIGABYTE CXL Workload Demo

Deep dive into CXL technology and its advantages for modern AI/HPC applications
Introduction
As technology advances, high-performance computing (HPC) and artificial intelligence (AI)-based services are increasingly integrated into everyday life. When discussing the enhancement of computational performance, attention is frequently directed toward CPUs and GPUs due to their significant processing capabilities. However, memory is equally essential to overall system performance, despite often receiving less recognition for its critical contributions.

Memory is where your computer temporarily stores the data it’s actively working on. The more memory you have—and the faster it is—the better your system can handle large, complex tasks. But here’s the problem: traditional memory technologies like DRAM (the kind found in most computers) are hitting limits. You can only fit so much of it on a motherboard, and high-capacity modules are expensive.

CXL is a new technology that enables computers to increase memory capacity by using PCIe connections, which are also used for devices like graphics cards and SSDs. While CXL memory may not match the speed of primary system memory, it can provide additional resources that support overall system performance.

This allows a system’s memory to be expanded beyond the physical limitations of the motherboard, providing a more cost-effective and adaptable solution.

Micron and GIGABYTE teamed up to run a series of real-world tests using GIGABYTE’s R284-A91-AAL3 CXL server, Micron’s CZ122 CXL memory expansion modules, DDR5 RDIMMs, and NVMe SSDs. The demo components are as follows:
R284-A91-AAL3

GIGABYTE R284-A91-AAL3

  • 2x Intel® Xeon® 6 CPU
  • 12-Channel DDR5 RDIMM
  • 16x E3.S 2T CXL Expansion
  • 4x E3.S Gen5 NVMe SSD
Micron CXL CZ122 Module

Micron CXL CZ122 Module

  • 128GB/256GB Capacity
  • Fully Supports CXL 2.0
  • Type 3 Memory Expansion
  • E3.S 2T Form Factor
Micron DDR5 RDIMMe

Micron DDR5 RDIMM

  • 128GB Module Capacity
  • 6400 MT/s Bandwidth
  • Innovative 1β technology
Micron 9550 NVMe SSD

Micron 9550 NVMe SSD

  • 15TB Storage Capacity
  • NVMe 2.0/OCP 2.0 Support
  • G8 TLC NAND
  • E3.S 1T Form Factor

We divided the tests into three categories, each highlighting a distinct benefit from CXL for clarity:
- CXL Memory Bandwidth Expansion
- CXL Memory Capacity Expansion
- CXL Cost Effectiveness
Software-Based Weighted Interleaving with CXL
Before evaluating the benchmarks, it is essential to consider a significant challenge related to CXL: its comparatively lower performance relative to direct-attached memory. Unlike conventional DRAM, which interfaces directly with the CPU via dedicated memory channels, CXL memory operates across the PCIe interface. This indirect connectivity results in increased latency, causing the CPU to require more time to access data stored in CXL memory as opposed to DRAM.

To capitalize on the additional CXL memory, a technique known as Software-Based Weighted Interleaving has been employed. This method efficiently balances data allocation between DRAM and CXL memory. To evaluate its effectiveness, the team utilized Intel’s Memory Latency Checker (MLC)—a tool designed to assess memory bandwidth and latency across varying workloads—and conducted microbenchmark tests using different read/write patterns and memory distribution ratios between DRAM and CXL.

Each test used a weighted interleaving approach, where memory pages were split between DRAM and CXL based on user-defined weights. For example:
  • A weight of 3:1 (DRAM:CXL) means 75% of memory traffic goes to DRAM, and 25% to CXL.

Imagine you’re driving from one city to another. There are four fast highway lanes (DRAM), but they’re getting crowded. Now you add a few slower side roads (CXL). Interleaving decides how to divide the traffic between them, so everything flows better. “Weighted” decides how much traffic takes the highway and how much traffic takes the side roads. It’s not just about speed—it’s about smart traffic control.

The weighted-interleaving feature, introduced in Linux kernel 6.9+, allows fine-grained control over memory allocation between DRAM and CXL memory. This enables optimized bandwidth utilization by assigning memory pages based on workload characteristics.

With this setup, data flow can be optimized even though DRAM and CXL have different latency and bandwidth.
Weight
DRAM
Weight
CXL
BW
Norm.
1 0 1.00
2 1 1.12
5 2 1.25
3 1 1.28
Workload: R (Read-Only)
Weight
DRAM
Weight
CXL
BW
Norm.
1 0 1.00
3 2 1.22
2 1 1.34
7 3 1.38
Workload: W2 (1R, 2W)
Weight
DRAM
Weight
CXL
BW
Norm.
1 0 1.00
3 2 1.25
5 3 1.35
2 1 1.44
Workload: W5 (1R, 1W)
Weight
DRAM
Weight
CXL
BW
Norm.
1 0 1.00
3 2 1.18
2 1 1.33
9 4 1.34
Workload: W10 (2R, 1W) NT
CXL Memory Bandwidth Expansion – Boosting Performance with More Bandwidth
The following section examines the effects of introducing CXL into practical workloads.

In applications that are intensive in memory usage, such as high-performance computing (HPC) and artificial intelligence (AI), increasing memory bandwidth can significantly enhance performance. This outcome was clearly observed in our results.

Performance across all four tested workloads increased by 22% to 33%, with the geometric mean of 28% increase across all HPC & AI workloads.
HPCG
Weight
DRAM
Weight
CXL
Performance
(GFLOPS)
Increase
1 0 94.32 1.00
3 1 120.75 1.28
Pot3D
Weight
DRAM
Weight
CXL
Execution
Time (s)
Speedup
1 0 706 1.00
5 2 539 1.31
CloverLeaf
Weight
DRAM
Weight
CXL
Execution
Time (s)
Speedup
1 0 116.74 1.00
9 4 87.53 1.33
FAISS
Weight
DRAM
Weight
CXL
Output Token
Latency (ms)
Speedup
1 0 2.28 1.00
2 1 1.87 1.22

Below is an overview of the workload we evaluated:

HPC Workloads
High-Performance Conjugate Gradients (HPCG) Solves large and sparse linear systems using a multigrid preconditioned conjugate gradient algorithm. Scientific and engineering workloads with lots of memory access required.
Pot3D Simulates the 3D Poisson equation: Molecular dynamics and physics problems that involve 3D electrostatic potential.
CloverLeaf Solves compressible Euler equations on a grid. Astrophysics, nuclear simulations, industrial shockwave modeling.
AI Workload
FAISS Uses Approximate Nearest Neighbor (ANN) search. AI workloads like recommendation systems, vector search, NLP embeddings.
CXL Memory Capacity Expansion – Scaling Up for Big Data
One advantage of CXL is its capacity to expand memory resources beyond the limitations of motherboard slots or the cost constraints of high-capacity DIMMs.

DuckDB, an analytical database engine, was used to test two benchmark suites: TPC-H and TPC-DS. TPC-H evaluates analytical queries on a simplified schema, while TPC-DS is a more complex benchmark designed to represent real-world retail database workloads with mixed query types.

Testing with DRAM+CXL+interleaving resulted in:
- 2.93x improvement on TPC-H
- 2.01x improvement on TPC-DS

These outcomes indicate that CXL may contribute to enhanced performance in decision support systems and big data applications. This shows that CXL isn’t just about speed, it’s about enabling bigger, more complex workloads that wouldn’t fit in memory otherwise.
Policy Instances Queries
Per Minute
Max Memory
(TB)
Perf. Gain
DRAM Only 2 2.52 1.44 1.00
DRAM+CXL Default 4 2.83 1.87 1.12
DRAM+CXL TPP 4 3.02 1.93 1.20
DRAM+CXL Interleave 8 7.38 2.86 2.93
Policy Instances Queries
Per Minute
Max Memory
(TB)
Perf. Gain
DRAM Only 4 1.84 1.56 1.00
DRAM+CXL Default 4 2.21 2.61 1.20
DRAM+CXL TPP 4 2.55 3.64 1.39
DRAM+CXL Interleave 6 3.70 3.77 2.01
CXL Cost Effectiveness – Saving Money Without Sacrificing Speed
Finally, it is important to consider cost efficiency.

We evaluated the performance of CXL memory by running the Deep Learning Recommendation Model (DLRM), which serves as a rigorous benchmark due to its high memory requirements and sensitivity to latency, particularly with large embedding tables. If CXL performs well under these demanding conditions, it suggests suitability for a wide range of workloads.

The observed performance impact was minimal:
- Approximately 2% degradation when 50% of memory was allocated via CXL
- Around 9% reduction when 67% of memory utilized CXL

In practical terms, this suggests organizations can significantly reduce expenses associated with purchasing large-capacity 128 GB RDIMMs—which may cost up to three times as much as 64 GB modules—while still maintaining most of the system performance. That’s a small trade-off for a big cost saving—especially at scale.
Configuration DLRM
Benchmark
Policy Normalized
Performance
1.5TB DRAM Only 17899 - 1
768GB DRAM + CXL 17555 SW Interleaving 0.98
512GB DRAM + CXL 16250 SW Interleaving 0.91

*Potential Cost Savings referenced may be subject to market change anytime, for more information please reach out to our Sales contact.

Conclusion
CXL is still a relatively new technology, but this demo proves it’s ready for real-world use especially when paired with smart memory interleaving methodology. Whether you’re building AI models, analyzing massive datasets, or running simulations, CXL offers:
- More memory bandwidth for faster performance
- More capacity to handle larger workloads
- Lower costs without major trade-offs

And with companies like Micron and GIGABYTE leading the charge, the future of computing is looking a lot more scalable—and a lot more efficient.

For any questions regarding the technology or products featured in our demonstration, please contact us.
Get the inside scoop on the latest tech trends, subscribe today!
Get Updates
Get the inside scoop on the latest tech trends, subscribe today!
Get Updates