To start, the intention of the AI data center needs to be understood up front, as this changes the type of AI data center that will be designed. AI data centers are designed to support two distinct workloads: training and/or inference. While both rely on high-performance computing (HPC) infrastructure, their design requirements, power consumption, cooling strategies, and networking configurations vary significantly.
Differences Between Inference and Training AI Data Center
Training AI data centers focus on massive parallel computation, require high-density GPUs/TPUs, and prioritize throughput over latency.
Inference AI data centers are optimized for low-latency real-time processing, with distributed deployment models and moderate compute intensity.
Cooling, power, and network requirements differ significantly, impacting design strategies.
1. Workload Differences
Aspect
Training AI Data Center
Inference AI Data Center
Purpose
Develops and optimizes AI models
Executes pre-trained models in real-time
Computational Demand
Extremely high (batch processing, iterative learning)
Moderate to high (low-latency, real-time processing)
Latency Sensitivity
Low (longer computation cycles acceptable)
Very high (must respond within milliseconds)
Processing Type
Large-scale matrix computations, parallelism
Lightweight, fast execution with smaller models
2. Hardware & Infrastructure Differences
A. Compute Architecture
Training AI Data Centers
Require thousands of GPUs, TPUs, or other AI accelerators
Use massively parallel processing for deep learning models
High-density racks with power draw with 50-500kW+ per rack (typical rack space)
Optimized for throughput that factors in latency for frontend and backend networks.
Inference AI Data Centers
Use fewer GPUs or specialized inference accelerators (like NVIDIA TensorRT, Intel Habana, or Google Edge TPUs)
Lower power per rack (15kW–50kW typically)
Prioritizes low latency over raw computing power
May integrate FPGA-based inference chips for edge processing
B. Cooling Requirements
Training AI Data Centers
Extreme heat generation (due to sustained workloads)
Mostly liquid cooling (typically greater than 80%), with air cooling (~20%) typical for high-density deployments and supporting network and other equipment
May include direct-to-chip cooling, rear-door, or immersion cooling
Inference AI Data Centers
Lower heat output allows for 50% air cooling, 50% liquid cooling
Traditional air cooling with high-efficiency HVAC may be sufficient
More distributed edge inference nodes require specialized cooling at smaller sites
C. Power Consumption
Training AI Data Centers
Massive power consumption (exceeding 100MW+ for dense training AI centers)
Redundancy ratios 7:8 or Tier II, as reliability is not a concern for the IT availability
Short duration UPS systems (may be in rack) to allow for soft equipment shutdown, if needed
Inference AI Data Centers
Lower energy requirements (5MW–20MW typical)
Optimized for low-latency power delivery, often with edge-based UPS
Smaller form-factor compute units for distributed processing
High reliability UPS and generator systems (N+1) to increase availability
Higher CapEx (Capital Expenditure) due to specialized hardware
More expensive cooling and power infrastructure to support a much higher rack/row/room capacity for dense GPU clusters
Higher power costs (longer, intensive training cycles)
Inference AI Data Centers
Lower upfront costs, but more deployments needed
Optimized for power efficiency and low operational costs
Flexible scaling based on demand (cloud-native inference solutions common)
Mechanical System Design specific to AI workloads
The increasing demand for AI-driven data centers requires a highly efficient and resilient mechanical system design. With high-density server racks supporting large-scale machine learning models, traditional air cooling solutions are often insufficient. A modern AI data center optimized for performance and sustainability benefits from a hybrid cooling approach: 80% water cooling and 20% air cooling. The design, equipment, and operational considerations for such a system, uses a 10MW data hall as the reference size.
1. Hybrid Cooling Strategy: Air and Water Cooling
In this design, 80% of the cooling load is managed through a water-based cooling system, while the remaining 20% is handled via air cooling. This hybrid approach is essential for high-density AI workloads, which generate significantly more heat than traditional IT loads.
Water Cooling (80%): Used for liquid-cooled servers, rear-door heat exchangers (RDHx), immersion cooling, and direct-to-chip cooling.
Air Cooling (20%): Supports traditional air-cooled IT racks, comfort cooling for personnel, and backup cooling strategies.
This combination ensures reliability, energy efficiency, and adaptability to varying IT loads.
2. Key Mechanical Systems and Equipment
A 10MW data hall with an 80/20 water-air cooling split requires a robust infrastructure to maintain efficient heat dissipation, redundancy, and sustainability. Below are the major mechanical components needed.
Propylene Glycol Mixture: 30% propylene glycol (PG) / 70% water
This is a typical mix to reduce PG decay and keep the system stabilized
Prevents freezing, reduces corrosion
100% water mixture option – for internal systems with stable temperature ranges and equipment that will accept water for improved efficiency
Cooling Distribution Units (CDU) for interchange of heat from the technical cooling loop to the chilled water heat rejection loop
Liquid Cooling Technologies
Direct-to-chip cooling (cold plates integrated into CPUs/GPUs)
Immersion cooling tanks (single-phase or two-phase immersion)
Rear-door heat exchangers (RDHx) (liquid-cooled doors attached to racks)
Heat Rejection System
Dry coolers (for water economization in favorable climates)
Adiabatic cooling units (to enhance efficiency)
B. Air Cooling System (20%)
Computer Room Air Handlers (CRAHs)
Chilled water-based air handling units, supported by chilled water plan
Located in hot aisle containment zones
Computer Room Air Conditioners (CRACs)
DX-based cooling
Deployed near traditional air-cooled racks
Hot and Cold Aisle Containment
Enhances cooling efficiency through air management
Prevents hot air recirculation
C. Additional Cooling Distribution & Redundancy
Pumped Refrigerant Systems (for additional heat removal)
Redundancy of equipment and systems, as needed: N+1 or 2N configurations, depending on the need.
3. Design Considerations for a 10MW AI Data Hall
A. Cooling Load Breakdown
Cooling Type
Percentage
Load (MW)
Water Cooling
80%
8MW
Air Cooling
20%
2MW
A 10MW IT load requires a total heat rejection capacity of ~12.5MW, accounting for power usage effectiveness (PUE) of 1.25.
B. Water Consumption & Efficiency
Water Usage Effectiveness (WUE): ≤0.2 L/kWh
Annual Water Consumption: ~20 million liters (site dependent)
Closed-loop cooling can eliminate water consumption for cooling, but requires additional power; a balance of efficiency to water use should be studied for an informed decision
C. Redundancy & Fault Tolerance – typical for AI
Chillers, Pumps, Heat Exchangers, Cooling towers/Fin fans: N+1 of either equipment or equipment line-ups for concurrent maintainability
CDUs and other water cooling equipment: N+1, or N+1 per array of equipment
4. Energy Efficiency & Sustainability
Use of Free Cooling (Water-Side Economization)
Reduces chiller energy consumption by up to 50%
Heat Reuse Systems
Redirects waste heat for district heating or facility operations
Smart AI-Driven Cooling Management
Optimizes cooling loads dynamically
Use of Renewable Energy for Cooling Operations
Electrical System Design specific to AI workloads
Modern AI data centers require mostly reliable and scalable electrical infrastructure to support intensive computing workloads. AI-driven operations, such as deep learning and large-scale inference tasks, demand a high-density power design with resilient redundancy configurations ranging from no need for redundancy to 3:4 to 7:8 ratios. The 3:4 to 7:8 redundancy ratios ensure reliability, while UPS, generators, and intelligent PDUs safeguard operations. Advanced power management strategies, such as AI-driven load balancing and renewable energy integration, enhance sustainability and efficiency.
A 10MW data hall consists of multiple IT load clusters, each requiring a stable power source with backup and failover mechanisms. The redundancy ratios—3:4 to 7:8—indicate that for every 3 to 7 units of active power, +1 units are provisioned to support. This ensures uninterrupted operation even during component maintenance or failures.
Redundancy Configurations for AI Data Centers
Redundancy Ratio
Effective Capacity
Usable IT Load
Backup Capacity
3:4
13.3MW
10MW
3.3MW
4:5
12.5MW
10MW
2.5MW
5:6
12MW
10MW
2MW
6:7
11.7MW
10MW
1.7MW
7:8
11.4MW
10MW
1.4MW
A higher redundancy ratio (3:4) provides greater fault tolerance, whereas lower ratios (7:8) optimize efficiency while still maintaining reliability.
2. Electrical Infrastructure & Power Distribution
To ensure continuous and stable power, inference AI data centers rely on a multi-tiered power distribution system with a combination of utility power, uninterruptible power supplies (UPS), and backup generators.
A. Primary Power Source: Utility Grid Connection
Dual-feed medium-voltage (MV) utility connections (typically 13.8kV to 33kV)
Switchgear and transformer step down from utility voltage to 480V for distribution
Power Usage Effectiveness (PUE) Target: ~1.25 or lower
B. Power Distribution System
Main Switchgear (13.8kV – 33kV)
Redundant feeders from separate substations
Load-sharing configurations (active-active or active-passive)
Step-Down Transformers (13.8kV → 480V or 415V)
High-efficiency dry-type or oil-filled transformers
Power Distribution Units (PDUs) – 480V to 208V/120V – optional
Deliver power to rack-level busbars or direct circuits
Remote Power Panels (RPPs) – optional
Provide branch circuit protection and rack flexibility to serve network, storage and other racks beyond the AI-dedicated racks
More modular scalability
C. Uninterruptible Power Supply (UPS) System
Online double-conversion UPS systems (N+1)
Supports 1.5–2.5MW per unit, typically
Lithium-ion battery banks for short-term power backup
D. Backup Power: Diesel Generators (N+1)
500kW to 3MW per unit, multiple generators in parallel or dedicated to specific line-ups (distributed or block redundancy)
Total on-site capacity: 12–14MW (for 10MW IT load)
10–72 hours of fuel storage capacity
E. Rack-Level Power Distribution
Busway Systems (overhead)
Rack PDUs (Redundant A/B Feeds) – optional
Intelligent PDUs with real-time power monitoring
Per-outlet metering for AI servers
3. Electrical Redundancy & Failover Strategies
A. Multi-Tiered Redundancy Design
To mitigate single points of failure, inferencing (and often training) AI data centers use:
Utility Redundancy (Dual Feeders from Grid)
Active-passive switching or fully active-active
UPS Redundancy (N+1)
Example: 12MW total UPS capacity for a 10MW load (5:6 ratio), each at 2MW
Generator Redundancy (N+1)
Ensures power during long-term outages
B. Power Transfer Systems
Automatic Transfer Switches (ATS) and Static Transfer Switches (STS)
Instant failover between utility, UPS, and generator power
Paralleling Switchgear for Generator Synchronization – optional
Load balancing and staged startup
4. Energy Efficiency & Sustainability
A. High-Efficiency Electrical Components
Eco Mode UPS (greater than 98% efficiency)
High-Efficiency Transformers (DOE 2016 Standard)
Smart PDUs with AI-driven Load Balancing
B. On-Site Renewable Energy Integration
Solar PV Systems (5–10% site power)
Battery Energy Storage Systems (BESS)
Grid Demand Response for Peak Shaving
C. Waste Heat Recovery
Heat capture from power systems for facility heating