Leadership Performance, Cost Efficient, Fully Open-Source
AMD Instinct™ MI350 Series
The AMD Instinct™ MI350 Series GPUs, launched in June 2025, represent a significant leap forward in data center computing, designed to accelerate generative AI and high-performance computing (HPC) workloads. Built on the cutting-edge 4th Gen AMD CDNA™ architecture and fabricated using TSMC's 3nm process, these GPUs deliver exceptional performance and energy efficiency for training massive AI models, high-speed inference, and complex HPC tasks like scientific simulations and data processing. Featuring 288GB of HBM3E memory and up to 8TB/s bandwidth, the MI350X and MI355X GPUs offer up to 4x generational AI compute improvement and a remarkable 35x boost in inference performance, positioning them as formidable competitors in the AI and HPC markets.
Optimize Next Gen Innovation with AMD ROCm™ Software
The AMD ROCm™ 7.0 software stack is a key differentiator that enables high-performance AI and HPC development with minimal code changes. AMD Instinct™ MI350 Series GPUs are fully optimized for leading frameworks such as PyTorch, TensorFlow, JAX, ONNX Runtime, Triton, and vLLM, and offer Day 0 support for popular models through automatic kernel generation and continuous validation. The ROCm 7.0 platform's unique combination of DeepEP pipelining, SGL cross-PD scheduling, and PD KV-cache transfers deliver significant advantages. AMD is a founding member of the PyTorch Foundation and actively contributes to OpenXLA and UEC, reinforcing their long-term commitment to open-source AI. With AMD Infinity Hub, users gain access to deployment-ready containers that simplify onboarding and accelerate time to value. AMD Instinct™ MI350 Series GPUs are purpose-built for scalable inference and training, with their elastic scaling capabilities and vendor-agnostic optimization enhanced by full Kubernetes integration enabled by the AMD GPU Operator.
AMD Instinct | |||
|---|---|---|---|
MI355X GPU | MI350X GPU | Model | MI325X GPU |
| TSMC N3P / TSMC N6 | Process Technology (XCD / IOD) | TSMC N5 / TSMC N6 | |
| AMD CDNA4 | GPU Architecture | AMD CDNA3 | |
| 256 | GPU Compute Units | 304 | |
| 16,384 | Stream Processors | 19,456 | |
| 185 Billion | Transistor Count | 153 Billion | |
| 10.1 PFLOPS | 9.2 PFLOPS | MXFP4 / MXFP6 | N/A |
| 5.0 / 10.1 POPS | 4.6 / 9.2 POPS | INT8 / INT8 (Sparsity) | 2.6 / 5.2 POPS |
| 78.6 TFLOPS | 72.1 TFLOPS | FP64 (Vector) | 81.7 TFLOPS |
| 5.0 / 10.1 PFLOPS | 4.6 / 9.2 PFLOPS | FP8 / OCP-FP8 (Sparsity) | 2.6 / 5.2 PFLOPS |
| 2.5 / 5.0 PFLOPS | 2.3 / 4.6 PFLOPS | BF16 / BF16 (Sparsity) | 1.3 / 2.6 PFLOPS |
| 288 GB HBM3E | Dedicated Memory Size | 256 GB HBM3E | |
| 8 TB/s | Memory Bandwidth | 6 TB/s | |
| PCIe Gen5 x16 | Bus Interface | PCIe Gen5 x16 | |
Passive & Liquid | Passive | Cooling | Passive & Liquid |
| 1400W | 1000W | Maximum TDP/TBP | 1000W |
| Up to 8 partitions | Virtualization Support | Up to 8 partitions | |
What's New in ROCm 7.0 ?
- Expanded Hardware & Platform Support : ROCm 7 is fully compatible with AMD Instinct™ MI350 Series GPUs (including MXFP6/MXFP4) and extends development to select AMD Radeon™ GPUs and Windows environments, ensuring seamless performance across diverse hardware from cloud to edge.
- Advanced AI Features & Optimizations : ROCm 7 is targeting large-scale AI and LLM deployments with pre-optimized transformer kernels (OCP-FP8/MXFP8/MXFP6/MXFP4), integrated distributed inference via vLLM v1, llm-d, and SGLang, and enhanced "flash" attention and communication libraries for peak multi-GPU utilization.
- Optimized Performance : The ROCm 7 preview delivered up to 3.5x faster AI inference and 3x quicker training than ROCm 6 by leveraging lower-precision data types and advanced kernel fusion to maximize GPU efficiency and reduce memory and I/O load. [1]
- Enabling Developer Success : With the new ROCm Enterprise AI suite, it is now easier to fine-tune models on domain-specific data and deploy AI services in production, streamline install with a simple pip install rocm flow, and support advanced optimization features such as model quantization libraries to boost productivity and performance.
- Expanded Ecosystem & Community Collaboration : ROCm 7 deepens integration with leading AI and HPC models and frameworks, offering day-0 support for PyTorch, TensorFlow, JAX, and ONNX and more, while giving organizations flexibility in model selection with over 2 million pre-trained models. Its broad ecosystem and open-source collaboration ensures stability, compatibility, and readiness for future workloads.
[1] (MI300-080): Testing by AMD as of May 15, 2025, measuring the inference performance in tokens per second (TPS) of AMD ROCm 6.x software, vLLM 0.3.3 vs. AMD ROCm 7.0 preview version SW, vLLM 0.8.5 on a system with (8) AMD Instinct MI300X GPUs running Llama 3.1-70B (TP2), Qwen 72B (TP2), and Deepseek-R1 (FP16) models with batch sizes of 1-256 and sequence lengths of 128-204. Stated performance uplift is expressed as the average TPS over the (3) LLMs tested. Results may vary.
AMD Instinct™ MI300 Series
OverviewGPU & APUSpecifications
Accelerators for the Exascale Era
- El Capitan and Frontier are the two fastest supercomputers on the TOP500 list and maintain outstanding energy efficiency on the GREEN500 list, powered by AMD EPYC™ processors and AMD Instinct™ GPUs and APUs. These technologies are now available in GIGABYTE servers for HPC, AI training and inference, and data intensive workloads.
- With AMD's data center APU and discrete GPUs, GIGABYTE has created and tailored powerful servers, in both passive air-cooled and liquid-cooled, to deliver accelerators for the Exascale era. The AMD Instinct™ MI325X and MI300X GPUs are designed for AI training, fine-tuning and inference. They are Open Accelerator Modules (OAMs) on a universal baseboard (UBB) housed inside GIGABYTE G-series servers. The AMD Instinct™ MI300A accelerated processing unit (APU), integrating CPU and GPU, targets HPC and AI, and it comes in an four-LGA sockets design in GIGABYTE G383 series servers.
Select GIGABYTE for the AMD Instinct™ Platform
Compute Density
Offering industry-leading compute-density servers in the 8U air-cooled G893 series and 4U liquid-cooled G4L3 series, the servers deliver greater performance per rack.
High Performance
The custom 8-GPU UBB-based servers ensure stable, peak performance from both CPUs and GPUs, with priority given to signal integrity and cooling.
Scale-out
Multiple expansion slots are available to be populated with Ethernet or InfiniBand NICs for high-speed communication between connected nodes.
Advanced Cooling
With the availability of server models using direct liquid cooling (DLC), CPUs and GPUs can dissipate heat faster and more efficiently with liquid cooling than with air.
Energy Efficiency
Real-time power management, automatic fan speed control, redundant Titanium PSUs, and the DLC option ensure optimal cooling and power efficiency.