AI Data Center Design

AI Data Center Design

To start, the intention of the AI data center needs to be understood up front, as this changes the type of AI data center that will be designed. AI data centers are designed to support two distinct workloads: training and/or inference. While both rely on high-performance computing (HPC) infrastructure, their design requirements, power consumption, cooling strategies, and networking configurations vary significantly.

Differences Between Inference and Training AI Data Center

Training AI data centers focus on massive parallel computation, require high-density GPUs/TPUs, and prioritize throughput over latency.
Inference AI data centers are optimized for low-latency real-time processing, with distributed deployment models and moderate compute intensity.
Cooling, power, and network requirements differ significantly, impacting design strategies.

1. Workload Differences

Aspect	Training AI Data Center	Inference AI Data Center
Purpose	Develops and optimizes AI models	Executes pre-trained models in real-time
Computational Demand	Extremely high (batch processing, iterative learning)	Moderate to high (low-latency, real-time processing)
Latency Sensitivity	Low (longer computation cycles acceptable)	Very high (must respond within milliseconds)
Processing Type	Large-scale matrix computations, parallelism	Lightweight, fast execution with smaller models

2. Hardware & Infrastructure Differences

A. Compute Architecture

Training AI Data Centers
- Require thousands of GPUs, TPUs, or other AI accelerators
- Use massively parallel processing for deep learning models
- High-density racks with power draw with 50-500kW+ per rack (typical rack space)
- Optimized for throughput that factors in latency for frontend and backend networks.
Inference AI Data Centers
- Use fewer GPUs or specialized inference accelerators (like NVIDIA TensorRT, Intel Habana, or Google Edge TPUs)
- Lower power per rack (15kW–50kW typically)
- Prioritizes low latency over raw computing power
- May integrate FPGA-based inference chips for edge processing

B. Cooling Requirements

Training AI Data Centers
- Extreme heat generation (due to sustained workloads)
- Mostly liquid cooling (typically greater than 80%), with air cooling (~20%) typical for high-density deployments and supporting network and other equipment
- May include direct-to-chip cooling, rear-door, or immersion cooling
Inference AI Data Centers
- Lower heat output allows for 50% air cooling, 50% liquid cooling
- Traditional air cooling with high-efficiency HVAC may be sufficient
- More distributed edge inference nodes require specialized cooling at smaller sites

C. Power Consumption

Training AI Data Centers
- Massive power consumption (exceeding 100MW+ for dense training AI centers)
- Redundancy ratios 7:8 or Tier II, as reliability is not a concern for the IT availability
- Short duration UPS systems (may be in rack) to allow for soft equipment shutdown, if needed
Inference AI Data Centers
- Lower energy requirements (5MW–20MW typical)
- Optimized for low-latency power delivery, often with edge-based UPS
- Smaller form-factor compute units for distributed processing
- High reliability UPS and generator systems (N+1) to increase availability

Networking & Storage Considerations

A. Network Infrastructure

Training AI Data Centers
- Require ultra-high-speed interconnects (400Gbps InfiniBand, NVLink, RDMA)
- Low network latency inside training clusters (essential for model parallelism)
  - InfiniBand of less than ~100m
- Dedicated storage clusters for massive dataset access
Inference AI Data Centers
- Lower networking demand (typically 25–100Gbps Ethernet)
- Optimized for low-latency data transfer rather than high-bandwidth aggregation
- Often deployed in edge computing environments for localized AI processing

B. Storage Requirements

Training AI Data Centers
- Need petabyte-scale storage for training datasets
- Use NVMe SSDs, high-speed flash storage
- Implement distributed file systems like Ceph, GPFS, Lustre
Inference AI Data Centers
- Require fast-access storage, but data footprint is smaller
- Can rely on tiered storage (SSD for active models, HDD for long-term storage)
- Object storage (S3-compatible) commonly used for serving models

Deployment & Scalability

Factor	Training AI Data Center	Inference AI Data Center
Scalability Model	Centralized (hyperscale or supercomputer clusters)	Distributed (cloud-edge hybrid, regional deployments)
Geographic Distribution	Fewer, but massive facilities	More widely distributed (closer to users)
Edge AI Deployment	Not common	Frequently deployed at edge locations

Cost Considerations

Training AI Data Centers
- Higher CapEx (Capital Expenditure) due to specialized hardware
- More expensive cooling and power infrastructure to support a much higher rack/row/room capacity for dense GPU clusters
- Higher power costs (longer, intensive training cycles)
Inference AI Data Centers
- Lower upfront costs, but more deployments needed
- Optimized for power efficiency and low operational costs
- Flexible scaling based on demand (cloud-native inference solutions common)

Mechanical System Design specific to AI workloads

The increasing demand for AI-driven data centers requires a highly efficient and resilient mechanical system design. With high-density server racks supporting large-scale machine learning models, traditional air cooling solutions are often insufficient. A modern AI data center optimized for performance and sustainability benefits from a hybrid cooling approach: 80% water cooling and 20% air cooling. The design, equipment, and operational considerations for such a system, uses a 10MW data hall as the reference size.

1. Hybrid Cooling Strategy: Air and Water Cooling

In this design, 80% of the cooling load is managed through a water-based cooling system, while the remaining 20% is handled via air cooling. This hybrid approach is essential for high-density AI workloads, which generate significantly more heat than traditional IT loads.

Water Cooling (80%): Used for liquid-cooled servers, rear-door heat exchangers (RDHx), immersion cooling, and direct-to-chip cooling.
Air Cooling (20%): Supports traditional air-cooled IT racks, comfort cooling for personnel, and backup cooling strategies.

This combination ensures reliability, energy efficiency, and adaptability to varying IT loads.

2. Key Mechanical Systems and Equipment

A 10MW data hall with an 80/20 water-air cooling split requires a robust infrastructure to maintain efficient heat dissipation, redundancy, and sustainability. Below are the major mechanical components needed.

A. Water Cooling System (80%)

Chilled Water Plant & Heat Rejection
- Water-cooled chillers (high-efficiency centrifugal or screw chillers) -or- Air-cooled chillers (high-efficiency, economizer options)
- Cooling towers (induced-draft, cross-flow, or counter-flow) -or- Closed loop fin-fan coolers
- Chilled water pumps (primary and secondary loops)
Technical Cooling Loop
- Propylene Glycol Mixture: 30% propylene glycol (PG) / 70% water
  1. This is a typical mix to reduce PG decay and keep the system stabilized
  2. Prevents freezing, reduces corrosion
- 100% water mixture option – for internal systems with stable temperature ranges and equipment that will accept water for improved efficiency
- Cooling Distribution Units (CDU) for interchange of heat from the technical cooling loop to the chilled water heat rejection loop
Liquid Cooling Technologies
- Direct-to-chip cooling (cold plates integrated into CPUs/GPUs)
- Immersion cooling tanks (single-phase or two-phase immersion)
- Rear-door heat exchangers (RDHx) (liquid-cooled doors attached to racks)
Heat Rejection System
- Dry coolers (for water economization in favorable climates)
- Adiabatic cooling units (to enhance efficiency)

B. Air Cooling System (20%)

Computer Room Air Handlers (CRAHs)
- Chilled water-based air handling units, supported by chilled water plan
- Located in hot aisle containment zones
Computer Room Air Conditioners (CRACs)
- DX-based cooling
- Deployed near traditional air-cooled racks
Hot and Cold Aisle Containment
- Enhances cooling efficiency through air management
- Prevents hot air recirculation

C. Additional Cooling Distribution & Redundancy

Pumped Refrigerant Systems (for additional heat removal)
Redundancy of equipment and systems, as needed: N+1 or 2N configurations, depending on the need.

3. Design Considerations for a 10MW AI Data Hall

A. Cooling Load Breakdown

Cooling Type	Percentage	Load (MW)
Water Cooling	80%	8MW
Air Cooling	20%	2MW

A 10MW IT load requires a total heat rejection capacity of ~12.5MW, accounting for power usage effectiveness (PUE) of 1.25.

B. Water Consumption & Efficiency

Water Usage Effectiveness (WUE): ≤0.2 L/kWh
Annual Water Consumption: ~20 million liters (site dependent)
Closed-loop cooling can eliminate water consumption for cooling, but requires additional power; a balance of efficiency to water use should be studied for an informed decision

C. Redundancy & Fault Tolerance – typical for AI

Chillers, Pumps, Heat Exchangers, Cooling towers/Fin fans: N+1 of either equipment or equipment line-ups for concurrent maintainability
CDUs and other water cooling equipment: N+1, or N+1 per array of equipment

4. Energy Efficiency & Sustainability

Use of Free Cooling (Water-Side Economization)
- Reduces chiller energy consumption by up to 50%
Heat Reuse Systems
- Redirects waste heat for district heating or facility operations
Smart AI-Driven Cooling Management
- Optimizes cooling loads dynamically
Use of Renewable Energy for Cooling Operations

Electrical System Design specific to AI workloads

Modern AI data centers require mostly reliable and scalable electrical infrastructure to support intensive computing workloads. AI-driven operations, such as deep learning and large-scale inference tasks, demand a high-density power design with resilient redundancy configurations ranging from no need for redundancy to 3:4 to 7:8 ratios. The 3:4 to 7:8 redundancy ratios ensure reliability, while UPS, generators, and intelligent PDUs safeguard operations. Advanced power management strategies, such as AI-driven load balancing and renewable energy integration, enhance sustainability and efficiency.

1. Electrical Load Breakdown & Redundancy Considerations

A 10MW data hall consists of multiple IT load clusters, each requiring a stable power source with backup and failover mechanisms. The redundancy ratios—3:4 to 7:8—indicate that for every 3 to 7 units of active power, +1 units are provisioned to support. This ensures uninterrupted operation even during component maintenance or failures.

Redundancy Configurations for AI Data Centers

Redundancy Ratio	Effective Capacity	Usable IT Load	Backup Capacity
3:4	13.3MW	10MW	3.3MW
4:5	12.5MW	10MW	2.5MW
5:6	12MW	10MW	2MW
6:7	11.7MW	10MW	1.7MW
7:8	11.4MW	10MW	1.4MW

A higher redundancy ratio (3:4) provides greater fault tolerance, whereas lower ratios (7:8) optimize efficiency while still maintaining reliability.

2. Electrical Infrastructure & Power Distribution

To ensure continuous and stable power, inference AI data centers rely on a multi-tiered power distribution system with a combination of utility power, uninterruptible power supplies (UPS), and backup generators.

A. Primary Power Source: Utility Grid Connection

Dual-feed medium-voltage (MV) utility connections (typically 13.8kV to 33kV)
Switchgear and transformer step down from utility voltage to 480V for distribution
Power Usage Effectiveness (PUE) Target: ~1.25 or lower

B. Power Distribution System

Main Switchgear (13.8kV – 33kV)
- Redundant feeders from separate substations
- Load-sharing configurations (active-active or active-passive)
Step-Down Transformers (13.8kV → 480V or 415V)
- High-efficiency dry-type or oil-filled transformers
Power Distribution Units (PDUs) – 480V to 208V/120V – optional
- Deliver power to rack-level busbars or direct circuits
Remote Power Panels (RPPs) – optional
- Provide branch circuit protection and rack flexibility to serve network, storage and other racks beyond the AI-dedicated racks
- More modular scalability

C. Uninterruptible Power Supply (UPS) System

Online double-conversion UPS systems (N+1)
Supports 1.5–2.5MW per unit, typically
Lithium-ion battery banks for short-term power backup

D. Backup Power: Diesel Generators (N+1)

500kW to 3MW per unit, multiple generators in parallel or dedicated to specific line-ups (distributed or block redundancy)
Total on-site capacity: 12–14MW (for 10MW IT load)
10–72 hours of fuel storage capacity

E. Rack-Level Power Distribution

Busway Systems (overhead)
Rack PDUs (Redundant A/B Feeds) – optional
- Intelligent PDUs with real-time power monitoring
- Per-outlet metering for AI servers

3. Electrical Redundancy & Failover Strategies

A. Multi-Tiered Redundancy Design

To mitigate single points of failure, inferencing (and often training) AI data centers use:

Utility Redundancy (Dual Feeders from Grid)
- Active-passive switching or fully active-active
UPS Redundancy (N+1)
- Example: 12MW total UPS capacity for a 10MW load (5:6 ratio), each at 2MW
Generator Redundancy (N+1)
- Ensures power during long-term outages

B. Power Transfer Systems

Automatic Transfer Switches (ATS) and Static Transfer Switches (STS)
- Instant failover between utility, UPS, and generator power
Paralleling Switchgear for Generator Synchronization – optional
- Load balancing and staged startup

4. Energy Efficiency & Sustainability

A. High-Efficiency Electrical Components

Eco Mode UPS (greater than 98% efficiency)
High-Efficiency Transformers (DOE 2016 Standard)
Smart PDUs with AI-driven Load Balancing

B. On-Site Renewable Energy Integration

Solar PV Systems (5–10% site power)
Battery Energy Storage Systems (BESS)
Grid Demand Response for Peak Shaving

C. Waste Heat Recovery

Heat capture from power systems for facility heating

John Peterson
March 27, 2025