SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC


Examining what erasure coding throughput can be achieved with an AMD EPYC 7601 processor teamed up with the MemoScale Erasure Coding Library. The tests have been performed with a GIGABYTE MZ31-AR0 server motherboard.

Download Whitepaper

INTRODUCTION

With the EPYC processor line, AMD is expected to take a strong position in the server market including the data storage segment. In data storage servers, CPUs need to perform several types of compute intensive data processing workloads such as erasure coding, deduplication and compression to store data efficiently. This white paper examines what erasure coding throughput can be achieved with an AMD EPYC 7601 processor teamed up with the MemoScale Erasure Coding Library. The tests are being performed with a GIGABYTE MZ31-AR0 server motherboard.

TRENDS - INCREASING DATA GROWTH AND FASTER STORAGE DEVICES

The amount of data being stored in the world doubles every second year. Thus, methods which reduce the hardware footprint and costs of data storage such as erasure coding, deduplication and compression are increasingly becoming imperative to deal with the rapid growth of data. At the same time storage devices with blazingly fast performance are making an entry, presenting ever greater challenges for processors to keep up with the increased data processing throughputs. In this white paper we will take a closer look at the performance when pushing erasure coding processing workloads maximally on an AMD EPYC CPU.

THE ROLE OF ERASURE CODING IN DATA STORAGE

As the size of data centers grows the probability of storage equipment failure or outage increases. This makes it essential to have protective measures in place to handle the failures which may occur on a daily or even hourly basis. Replication, i.e. making exact copies of data, has been the de facto method for protecting data against loss at a large scale, but the method has the obvious disadvantage of storing the data multiple times and thus often requires the same multiplier of costly storage equipment.

Erasure coding is an alternative to replication which follows the principles of RAID5 and RAID6 of separating the data into chunks and then adding some additional chunks of data which can be used to recover lost chunks of data. All the chunks are then distributed onto various storage mediums or failure domains. The main advantage of erasure coding is that it achieves the same or better protection against data loss as replication while reducing the total amount of data stored by 50 % – 80 %.

Erasure coding comes with some costs, one being that the redundancy segments needs to be calculated and thus require some compute resources. Increasing the speed of encoding and decoding with erasure codes can increase the storage system throughput and reduce the retrieval latency.

AMD EPYC

With the launch of EPYC which is AMD’s x86 server processor line based on the Zen microarchitecture, AMD delivers strongly on both performance and cost-performance ratio. Many IT organizations purchase dual socket servers and only populate a single socket. Others purchase dual socket servers, not because they need the compute capability, but because they need more I/O and/or memory capacity than what is available on current single socket servers. AMD EPYC enables no-compromise single socket servers with up to 32 cores, 8 memory channels and 128 PCIe® 3.0 lanes enabling capabilities and performance previously available only in dual socket architectures.

GIGABYTE SERVER MOTHERBOARDS

GIGABYTE puts three decades of know-how in motherboard design at the service of cutting edge server motherboards. GIGABYTE motherboards have been perfectly optimized for AMD EPYC, achieving one of the top scores on the SPEC CPU 2017 Benchmark for single and dual socket AMD EPYC systems*. Additional features of the MZ31-AR0 include dual 10GbE networking ports, an onboard M.2 port for dense high speed storage, and support for AMD’s Radeon Instinct MI25 GPU. GIGABYTE’s MZ31-AR0 motherboard also forms the base of their S451-Z30 4U 36 x 3.5” HDD Storage Server. Please see APPENDIX A for further information on the MZ31-AR0.

*As of May 2018

MEMOSCALE ERASURE CODING LIBRARY

The MemoScale Erasure Coding Library features optimized encoding and decoding with Reed Solomon erasure code for a wide range of processors. The library also supports various types of proprietary erasure coding algorithms which further improve performance as well as reduce network traffic and hardware costs. The MemoScale Erasure Coding Library can be integrated into proprietary storage systems. In addition, MemoScale provides erasure coding plugins for open source storage systems such as CEPH, SWIFT and HDFS.

SYSTEM CONFIGURATION

CPU AMD EPYC 7601
CPU-GHz 2.20 GHz
SOCKETS 1
CORES 32
THREADS 64
NUMA NODES 4
MEMORY CHANNELS 8
L1 CACHE, I/D 32KB / 64KB
L2 CACHE, PER/TOTAL 512KB / 16MB
L3 CACHE, PER/TOTAL 8MB / 64MB
MEMORY SIZE 64GB – 8 x 8GB
MEMORY SPEED 2666MHz
MEMORY TYPE DDR4 – 2666
LIBRARIES USED MemoScale Erasure Coding Library v.2.4.1
CPU PERFORMANCE GOVERNOR Performance
TURBO Disabled
OS Ubuntu 16.04.3 LTS
AMD LINUX KERNEL 4.13.0-43-generic
GCC/G++ (Ubuntu5.5.0-12ubuntu1~16.04)
5.5.0 20171010
MOTHERBOARD GIGABYTE MZ31-AR0

HOW WE TESTED

The MemoScale Erasure Coding Benchmark Tool was used to assess the erasure coding performance of different erasure coding libraries for both systems. The benchmark tool provides a plugin-system, where different erasure coding libraries can be loaded and performance benchmarked. Each thread used in the benchmark gets its own randomized 1GB of data in pre-allocated buffers, where each individual buffer contains 4 KB / 4096 KB. A tight loop then runs the encoding function for the specific erasure coding library a predefined number of times. Each iteration uses a random subset of 14 buffers, to force the buffers to be fetched from the main memory and not from the cache.

To ensure that no thread is rescheduled to another CPU core, each thread is pinned to a specific core during the whole benchmarking process. This preserves the CPU-locality of the allocated memory for that specific thread. Turbo boost was turned off.

The configuration used for testing the performance is 10 data blocks and 4 redundancy blocks with a Vandermonde-based Reed Solomon erasure code. The tests were run with two different block sizes: 4 KB and 4096 KB. The results for decoding tests included in this paper are for decoding of only one lost data block - which the most common loss scenario in storage systems.

Performance has been measured in terms of the amount of data being encoded or decoded excluding the encoded redundancy blocks being calculated. To convert it to a measurement of throughput of both data and redundancy blocks the measurements can be multiplied with a factor which reflects the used overhead level (1.4 for an erasure coding configuration with 10 data blocks and 4 redundancy blocks).

ENCODING AND DECODING DATA WITH ERASURE CODING

Encoding data with erasure codes is the process of generating the erasure coding redundancy blocks and it is done whenever data is written to storage systems which employ erasure coding. Higher encoding speeds can result in higher write throughput and reduced write latency.

Decoding data with erasure coding is the process of recovering original data blocks from other data and redundancy blocks and is done in storage systems when the systems recover from failures of storage equipment or when degraded reads need to be done. A degraded read is the process of reading data from a storage system where one or more of the original data blocks are lost or temporarily unavailable and thus needs to be decoded. Higher decoding speeds can result in higher throughput and reduced latency of degraded reads.

SINGLE CORE ENCODING & DECODING PERFORMANCE

Single Core Encoding Performance
Threads/Cores 1
4KB block size 5,110.45 MB/s
4096KB block size 5,398.41 MB/s
Single Core Decoding Performance
Threads/Cores 1
4KB block size 11,473.5 MB/s
4096KB block size 14,261.6 MB/s

MULTI CORE ENCODING & DECODING PERFORMANCE

Multi Core Encoding Performance
Threads/Cores 1 2 4 8 16 32
4KB block size 5,110.45 MB/s 10,326.1 MB/s 19,986.9 MB/s 40,512.5 MB/s 72,064.4 MB/s 88,047.4 MB/s
4096KB block size 5,398.41 MB/s 10,908.7 MB/s 21,586 MB/s 43,115.9 MB/s 71,899.5 MB/s 83,669.6 MB/s
Multi Core Decoding Performance
Threads/Cores 1 2 4 8 16 32
4KB block size 11,473.5 MB/s 22,996.4 MB/s 45,837.9 MB/s 79,960.6 MB/s 94,079.0 MB/s 107,003.0 MB/s
4096KB block size 14,261.6 MB/s 27,575.5 MB/s 57,175.1 MB/s 85,776.8 MB/s 95,671.6 MB/s 110,967.0 MB/s

CONCLUSION

In this paper we evaluated the erasure coding performance of the AMD EPYC 7601 teamed up with the MemoScale Erasure Coding Library. The tests were performed on a GIGABYTE MZ31-AR0 server motherboard.

The single core encoding performance of the EPYC is above 5 GB/s for both 4KB and 4096KB blocks. The single core decoding performance is above 10 GB/s for 4KB blocks and surpassing 14 GB/s for 4096KB blocks. In multi core tests the 8 memory channels and 32 cores of the EPYC make it possible to reach an impressive encoding performance of 88 GB/s and a decoding performance of 111 GB/s.

The results demonstrate EPYC’s impressive ability to move and process large amounts of data fast. This could have implications for how AMD’s processors can be used as well as how storage systems can be constructed to make full use of this advantage. In the time to come we aim to benchmark other compute intensive workloads for storage systems on the EPYC processor using MemoScale software.

GIGABYTE MZ31-AR0 SERVER MOTHERBOARD

MZ31-AR0 Specifications:
  • AMD EPYC™ 7000 series processor family
  • 8-Channel RDIMM/LRDIMM DDR4, 16 x DIMMs
  • 2 x SFP+ 10Gb/s LAN ports (Broadcom® BCM 57810S)
  • 1 x dedicated management port
  • 4 x SlimSAS (for 16 x SATA 6Gb/s) ports
  • Ultra-Fast M.2 with PCIe Gen3 x4 interface
  • Up to 4 x PCIe Gen3 x16 slots and 3 x PCIe Gen3 x8 slot
  • Aspeed® AST2500 remote management controller
    Learn More