AI Data Center Design

To start, the intention of the AI data center needs to be understood up front, as this changes the type of AI data center that will be designed. AI data centers are designed to support two distinct workloads: training and/or inference. While both rely on high-performance computing (HPC) infrastructure, their design requirements, power consumption, cooling strategies, and networking configurations vary significantly.

Differences Between Inference and Training AI Data Center


1. Workload Differences

AspectTraining AI Data CenterInference AI Data Center
PurposeDevelops and optimizes AI modelsExecutes pre-trained models in real-time
Computational DemandExtremely high (batch processing, iterative learning)Moderate to high (low-latency, real-time processing)
Latency SensitivityLow (longer computation cycles acceptable)Very high (must respond within milliseconds)
Processing TypeLarge-scale matrix computations, parallelismLightweight, fast execution with smaller models

2. Hardware & Infrastructure Differences

A. Compute Architecture

B. Cooling Requirements

C. Power Consumption


Networking & Storage Considerations

A. Network Infrastructure

B. Storage Requirements


Deployment & Scalability

FactorTraining AI Data CenterInference AI Data Center
Scalability ModelCentralized (hyperscale or supercomputer clusters)Distributed (cloud-edge hybrid, regional deployments)
Geographic DistributionFewer, but massive facilitiesMore widely distributed (closer to users)
Edge AI DeploymentNot commonFrequently deployed at edge locations

Cost Considerations


Mechanical System Design specific to AI workloads

The increasing demand for AI-driven data centers requires a highly efficient and resilient mechanical system design. With high-density server racks supporting large-scale machine learning models, traditional air cooling solutions are often insufficient.  A modern AI data center optimized for performance and sustainability benefits from a hybrid cooling approach: 80% water cooling and 20% air cooling.  The design, equipment, and operational considerations for such a system, uses a 10MW data hall as the reference size.


1. Hybrid Cooling Strategy: Air and Water Cooling

In this design, 80% of the cooling load is managed through a water-based cooling system, while the remaining 20% is handled via air cooling. This hybrid approach is essential for high-density AI workloads, which generate significantly more heat than traditional IT loads.

This combination ensures reliability, energy efficiency, and adaptability to varying IT loads.


2. Key Mechanical Systems and Equipment

A 10MW data hall with an 80/20 water-air cooling split requires a robust infrastructure to maintain efficient heat dissipation, redundancy, and sustainability.  Below are the major mechanical components needed.

A. Water Cooling System (80%)

  1. Chilled Water Plant & Heat Rejection
    • Water-cooled chillers (high-efficiency centrifugal or screw chillers) -or- Air-cooled chillers (high-efficiency, economizer options)
    • Cooling towers (induced-draft, cross-flow, or counter-flow) -or- Closed loop fin-fan coolers
    • Chilled water pumps (primary and secondary loops)
  2. Technical Cooling Loop
    • Propylene Glycol Mixture: 30% propylene glycol (PG) / 70% water
      1. This is a typical mix to reduce PG decay and keep the system stabilized
      2. Prevents freezing, reduces corrosion
    • 100% water mixture option – for internal systems with stable temperature ranges and equipment that will accept water for improved efficiency
    • Cooling Distribution Units (CDU) for interchange of heat from the technical cooling loop to the chilled water heat rejection loop
  3. Liquid Cooling Technologies
    • Direct-to-chip cooling (cold plates integrated into CPUs/GPUs)
    • Immersion cooling tanks (single-phase or two-phase immersion)
    • Rear-door heat exchangers (RDHx) (liquid-cooled doors attached to racks)
  4. Heat Rejection System
    • Dry coolers (for water economization in favorable climates)
    • Adiabatic cooling units (to enhance efficiency)

B. Air Cooling System (20%)

  1. Computer Room Air Handlers (CRAHs)
    • Chilled water-based air handling units, supported by chilled water plan
    • Located in hot aisle containment zones
  2. Computer Room Air Conditioners (CRACs)
    • DX-based cooling
    • Deployed near traditional air-cooled racks
  3. Hot and Cold Aisle Containment
    • Enhances cooling efficiency through air management
    • Prevents hot air recirculation

C. Additional Cooling Distribution & Redundancy

  1. Pumped Refrigerant Systems (for additional heat removal)
  2. Redundancy of equipment and systems, as needed: N+1 or 2N configurations, depending on the need. 

3. Design Considerations for a 10MW AI Data Hall

A. Cooling Load Breakdown

Cooling TypePercentageLoad (MW)
Water Cooling80%8MW
Air Cooling20%2MW

A 10MW IT load requires a total heat rejection capacity of ~12.5MW, accounting for power usage effectiveness (PUE) of 1.25.

B. Water Consumption & Efficiency

C. Redundancy & Fault Tolerance – typical for AI


4. Energy Efficiency & Sustainability



Electrical System Design specific to AI workloads

Modern AI data centers require mostly reliable and scalable electrical infrastructure to support intensive computing workloads.  AI-driven operations, such as deep learning and large-scale inference tasks, demand a high-density power design with resilient redundancy configurations ranging from no need for redundancy to 3:4 to 7:8 ratios.  The 3:4 to 7:8 redundancy ratios ensure reliability, while UPS, generators, and intelligent PDUs safeguard operations. Advanced power management strategies, such as AI-driven load balancing and renewable energy integration, enhance sustainability and efficiency.


1. Electrical Load Breakdown & Redundancy Considerations

A 10MW data hall consists of multiple IT load clusters, each requiring a stable power source with backup and failover mechanisms.  The redundancy ratios—3:4 to 7:8—indicate that for every 3 to 7 units of active power, +1 units are provisioned to support.  This ensures uninterrupted operation even during component maintenance or failures.

Redundancy Configurations for AI Data Centers

Redundancy RatioEffective CapacityUsable IT LoadBackup Capacity
3:413.3MW10MW3.3MW
4:512.5MW10MW2.5MW
5:612MW10MW2MW
6:711.7MW10MW1.7MW
7:811.4MW10MW1.4MW

A higher redundancy ratio (3:4) provides greater fault tolerance, whereas lower ratios (7:8) optimize efficiency while still maintaining reliability.


2. Electrical Infrastructure & Power Distribution

To ensure continuous and stable power, inference AI data centers rely on a multi-tiered power distribution system with a combination of utility power, uninterruptible power supplies (UPS), and backup generators.

A. Primary Power Source: Utility Grid Connection

B. Power Distribution System

  1. Main Switchgear (13.8kV – 33kV)
    • Redundant feeders from separate substations
    • Load-sharing configurations (active-active or active-passive)
  2. Step-Down Transformers (13.8kV → 480V or 415V)
    • High-efficiency dry-type or oil-filled transformers
  3. Power Distribution Units (PDUs) – 480V to 208V/120V – optional
    • Deliver power to rack-level busbars or direct circuits
  4. Remote Power Panels (RPPs) – optional
    • Provide branch circuit protection and rack flexibility to serve network, storage and other racks beyond the AI-dedicated racks
    • More modular scalability

C. Uninterruptible Power Supply (UPS) System

D. Backup Power: Diesel Generators (N+1)

E. Rack-Level Power Distribution


3. Electrical Redundancy & Failover Strategies

A. Multi-Tiered Redundancy Design

To mitigate single points of failure, inferencing (and often training) AI data centers use:

  1. Utility Redundancy (Dual Feeders from Grid)
    • Active-passive switching or fully active-active
  2. UPS Redundancy (N+1)
    • Example: 12MW total UPS capacity for a 10MW load (5:6 ratio), each at 2MW
  3. Generator Redundancy (N+1)
    • Ensures power during long-term outages

B. Power Transfer Systems


4. Energy Efficiency & Sustainability

A. High-Efficiency Electrical Components

B. On-Site Renewable Energy Integration

C. Waste Heat Recovery


footer_logo
© 2022, Green Data Center Guide. All Rights Reserved.