C2G-Bench: Hierarchical AI Orchestration for Grid-Interactive Hyperscale Data Centers

1. Executive Summary

This project addresses the "AI-Energy Paradox" by transforming 250 MW+ hyperscale data centers from passive power consumers into active, grid-balancing assets. By establishing a formal Energy System Handshake, we enable data centers to provide wholesale Frequency Regulation, stabilizing the regional transmission grid in exchange for significant revenue and faster deployment permits.

We solve this using a Hierarchical AI Orchestration framework that bridges long-term energy market bidding (minutes/hours) and sub-second hardware physics. The framework evaluates the synergy between three critical control levers: Throttling Batch Workloads (DVFS), Modulating Cooling Thermal Inertia (CDU pump), and Dispatching Battery Energy Storage (BESS). This project delivers a high-fidelity cyber-physical benchmark for NeurIPS 2026, at the frontier of autonomous, grid-interactive infrastructure.

2. Background: Grid Frequency Regulation and RegD

What is the RegD signal?

Every power grid must keep its frequency exactly at 60 Hz (US) or 50 Hz (EU) at all times. When a generator trips offline or a large load turns on suddenly, frequency deviates. Grid operators use Automatic Generation Control (AGC) to recruit fast-response providers — assets that can inject or absorb power within seconds to correct the imbalance.

FERC Order 755 (2011) created a pay-for-performance market for exactly this. Instead of paying only for available capacity (MW committed), it mandates that grid operators also pay for accuracy — how precisely an asset tracks the real-time regulation signal. PJM (the largest US grid operator) implemented this as the RegD signal: the "D" stands for dynamic, meaning it is designed for fast-response resources such as batteries and flexible loads.

How the RegD signal works

Every 2–5 seconds, the grid operator broadcasts a normalized score:

$$\text{RegD}(t) \in [-1,\, +1]$$

The sign convention is:

Signal value	Grid instruction	Data center must...
+1	Grid has excess load — reduce grid draw	Shed batch load, discharge BESS, or slow cooling
−1	Grid has excess generation — absorb more	Increase batch load, charge BESS, or raise cooling
0	Balanced	Hold current power level

The actual MW response required is:

$$\Delta P_{\text{demanded}} = C_{\text{MW}} \times \text{RegD}(t)$$

where $C_{\text{MW}}$ (committed_mw) is the regulation capacity the data center has pre-contracted to the market for the current 15-minute settlement interval.

Statistical properties (AR(1) model)

The RegD signal is statistically modelled as a first-order autoregressive (AR(1)) process — persistent but zero-mean. At the 5-minute scale it has autocorrelation ρ ≈ 0.80, which time-scales to ρ ≈ 0.997 at the 5-second simulation step used in C2G-Bench. The signal averages to zero over a settlement period, meaning the data center neither gains nor loses net energy from providing regulation.

In c2g_env/physics/macro_grid.py:

self._regd_state = rho * self._regd_state + sigma * noise  # AR(1)
regd = np.clip(self._regd_state, -1.0, 1.0)               # normalise to [-1,1]

The performance score (mileage metric)

Under FERC Order 755, the performance score is the correlation between the demanded signal and the actual response. A score of 1.0 = perfect tracking; 0.0 = random; below a threshold (typically 0.75) results in zero payment and market suspension. This maps directly to the β tracking term in the C2G reward function.

Why a data center is uniquely suited

A 250 MW hyperscale facility has three fast-response levers unavailable to most grid assets:

Batch compute DVFS — schedulable HPC/AI training jobs can be throttled in milliseconds via CPU/GPU frequency scaling. Service capacity is capped at p_flex_max × throttle (~90 MW); unserved work is deferred into a FIFO queue (not dropped) and served when capacity recovers. Average queue delay is tracked via Little's Law and exposed in obs[16] (backlog_norm).
BESS — the on-site 15 MWh / 5 MW battery can charge or discharge at full rate in under 100 ms, providing the fastest regulation response.
Thermal inertia (CDU pump) — the liquid cooling loop acts as a thermal capacitor (τ ≈ 12.7 min). Slowing the pump briefly stores heat in the water loop without immediately raising server temperatures, providing ~5–10 MW of additional regulation headroom for short intervals.

These three levers in combination can follow a RegD signal far more accurately than a single-asset provider, while the hierarchical RL agent learns the optimal trade-off between grid revenue, compute throughput, and thermal safety.

3. Problem Statement: The "Handshake" Gap

Current data center management systems are "grid-blind": they optimize internal efficiency (PUE) while ignoring the real-time needs of the regional energy system.

The Grid Need: Modern grids require large loads to respond to Frequency Regulation signals (e.g., PJM RegD) every 2–4 seconds to balance renewable energy volatility.
The Datacenter Barrier: Standard AI controllers cannot track these high-speed signals because they do not account for the non-linear physics of liquid cooling, battery degradation, and the bursty nature of GenAI workloads.
The Objective: Create a synergy where the data center matches the grid's power signal perfectly without violating hardware safety limits or AI training SLAs.

4. State-of-the-Art and Our Contribution

SOTA	Gap	Our Step Further
Wang et al., 2019 — Proved DCs can follow grid signals using DVFS.	Used "dummy loads" to intentionally waste power to meet the signal.	We use BESS + thermal storage synergy — no wasted power.
Fu et al., 2021 — Demonstrated cooling systems have "thermal inertia" for grid services.	Relies on classical MPC, which fails under unpredictable GenAI serving spikes.	We replace MPC with Hierarchical RL to handle extreme, non-linear volatility of Alibaba GenAI traces.
Li et al., 2026 — Identifies the need for intelligent VPP aggregation.	Lacks a standardized, high-fidelity physical testbed for datacenters.	We provide the first 250 MW-scale evaluation testbed with real data across 6 global energy markets.

5. Technical Solution: Hierarchical AI Orchestration

5.0. Formal MDP Specification

C2G-Bench defines a two-level hierarchical Markov Decision Process. The two agents share no parameters and communicate only through the inner_action_fn interface.

Lower-Level MDP — C2GFastEnv (5-second ticks)

$$M_{\text{low}} = (\mathcal{S},\, \mathcal{A},\, P,\, R,\, \gamma,\, T)$$

Symbol	Definition
$\mathcal{S} \subset \mathbb{R}^{18}$	Normalised observation vector (see §5.2 for index definitions)
$\mathcal{A} = [0,1]^3 \times [-1,1]$	Continuous 4-D action: throttle, pump speed, HVAC effort, BESS dispatch
$P(s_{t+1} \mid s_t, a_t)$	Deterministic physics step + stochastic AR(1) RegD signal (see §2)
$R(s_t, a_t)$	7-term scalar reward (see §5.3)
$\gamma = 0.99$	Training discount; undiscounted episodic sum used for benchmark ranking
$T = 17{,}280$	Steps per episode (24 h at 5 s per step)

The only stochasticity in $P$ arises from the AR(1) process driving RegD$(t)$. All physics engines (thermal, BESS, electrical) are deterministic given $(s_t, a_t)$. A fixed seed fully determines the trajectory.

Terminal states: the episode ends early on three hard constraints — thermal fault ($T > 35,°\text{C}$), frequency fault ($|\Delta f| > 0.5,\text{Hz}$), or voltage fault ($v_\text{pcc} < 0.90,\text{pu}$).

Upper-Level Semi-MDP — C2GMacroEnv (15-minute ticks)

The macro agent is framed as a Semi-MDP (Sutton et al., 1999) with fixed option duration $K = 180$ sub-steps:

$$M_{\text{macro}} = (\mathcal{S}_M,\, \mathcal{A}_M,\, P_M,\, R_M,\, \gamma_M,\, T_M)$$

Symbol	Definition
$\mathcal{S}_M \subset \mathbb{R}^{19}$	Aggregated sub-step states: component-wise means + SOC endpoint + extrema + market context
$\mathcal{A}_M = [0,1]^2$	2-D: `bid_mw_norm` (MW capacity to offer), `bid_price_norm` (asking price)
$P_M$	$K$ applications of the lower-level transition $P$
$R_M$	$\lambda_{\text{rev}} \times \text{regulation revenue} + \bar{r}K - \lambda{\text{elec}} \times \text{electricity cost} - \lambda_{\text{churn}} \times
$\gamma_M = \gamma^K$	$0.99^{180} \approx 0.163$ effective discount per macro step
$T_M = 96$	Macro steps per episode (24 h $\div$ 15 min)

where

$$\bar{r}_K = \frac{1}{K}\sum_{i=0}^{K-1} r_i$$

and $r_i$ is the 5-second reward at sub-step i. Thus, $\bar{r}_K$ is the mean of the 180 fast-step rewards in macro step $k$.

The macro agent never directly observes the 5-second physics — it sees only the aggregated $\mathcal{S}_M$. This induces partial observability at the macro level that the agent must compensate for through robust bidding policies.

5.1. Upper-Level Agent: The Market Orchestrator (15-min ticks)

Manages the "Business Handshake." Observes regional market prices, weather forecasts, and the Alibaba batch job queue.

Decision: "How much MW capacity should I bid to the grid operator, and at what price, for the next 15 minutes?"

Grid operators (PJM, ERCOT, etc.) clear ancillary-service markets in 15-minute settlement intervals. The MacroEnv implements a 3-phase market handshake each macro step: (1) the grid posts its RMCP and residual regulation need, (2) the DC agent bids MW capacity at an asking price, and (3) the grid probabilistically accepts the bid via a sigmoid function. If rejected, the DC falls back to a standing Demand Response (DR) baseline contract. The macro agent's challenge is to bid optimally under uncertainty about the next 180 RegD ticks, the next GenAI spike, and how much thermal headroom will remain at the end of the interval. The correct strategy is context-dependent — bid aggressively when the BESS is full and LMP is high; bid conservatively when ambient temperature is near the thermal limit or SOC is low. The macro agent never sees individual 5-second ticks; it only receives an aggregated summary after the interval completes, making this a partially observable planning problem.

Action Space (2-D): [bid_mw_norm ∈ [0,1], bid_price_norm ∈ [0,1]] — MW capacity to offer (mapped to [0, committed_max_mw]) and asking price (mapped to [0, 2 × rmcp_max]).

Observation Space (19-D): Aggregated over 180 sub-steps:

Index	Name	Range	Description
0	`temp_A_mean`	[0, 1]	Mean Zone A temperature / T_safe
1	`temp_B_mean`	[0, 1]	Mean Zone B temperature / T_safe
2	`bess_soc_end`	[0, 1]	SOC at end of the macro-step
3	`p_base_mean`	[0, 1]	Mean p_base_norm
4	`p_facility_mean`	[0, 2]	Mean p_facility_norm
5	`regd_mean`	[0, 1]	Mean
6	`lmp_mean`	[0, 1]	Mean lmp_norm
7	`grid_load_mean`	[0, 1]	Mean load_norm
8	`tracking_err_mean`	[0, 2]	Mean
9	`is_spike_any`	{0, 1}	1.0 if any sub-step had a GenAI spike
10	`thermal_headroom_A`	[0, 1]	(T_safe − T_A_max) / T_safe
11	`thermal_headroom_B`	[0, 1]	(T_safe − T_B_max) / T_safe
12	`bid_mw_prev_norm`	[0, 1]	Previous macro-action bid MW
13	`bid_price_prev_norm`	[0, 1]	Previous macro-action bid price
14	`freq_dev_mean`	[-1, 1]	Mean normalised frequency deviation
15	`v_pcc_mean`	[0, 1.1]	Mean PCC voltage (per-unit)
16	`backlog_norm_mean`	[0, 2]	Mean batch queue depth / p_flex_max
17	`rmcp_norm`	[0, 5]	Grid's posted RMCP / rmcp_max
18	`reg_need_norm`	[0, 5]	Grid's residual regulation need / committed_max

Reward: $R_{\text{macro}} = \lambda_{\text{rev}} \times \text{regulation_revenue} / 1000 + \bar{r}K - \lambda{\text{elec}} \times \text{electricity_cost} / 1000 - \lambda_{\text{churn}} \times |\text{bid_mw_now} - \text{bid_mw_prev}|$
- $\lambda_{\text{rev}} = 1.0$, $\lambda_{\text{elec}} = 0.5$, $\lambda_{\text{churn}} = 0.05$ (all in c2g_env/config.yaml)

5.2. Lower-Level Agent: The Hardware Controller (5 s ticks)

Executes the physical "Handshake." Receives the real-time frequency regulation signal and uses four physical levers. The central difficulty is that these levers have fundamentally different dynamics, costs, and side-effects — the agent must learn to combine them in the right order:

Lever	Response time	Capacity	Side-effect	Role
BESS `action[3]`	<100 ms	5 MW / 15 MWh	Depletes; capacity fade over time	First resort — fastest, zero-penalty, finite
CDU pump `action[1]`	Minutes (τ ≈ 12.7 min)	~30 MW equivalent	Thermal inertia; slow and partially irreversible on short intervals	Second resort — exploits physics cheaply
HVAC `action[2]`	Seconds	~50 MW draw	Affects Zone B only; draws additional facility power	Defensive — prevents thermal fault, not a primary regulation lever
IT throttle `action[0]`	Milliseconds	Up to full flex load	Accrues FIFO backlog; cuts throughput revenue	Last resort — immediate but highest SLA cost

The optimal policy learns this hierarchy: use BESS first (fast, free), borrow thermal inertia second (slow, cheap), fall back to DVFS only when both are exhausted. This mirrors how FERC-paid fast-response providers operate in real ancillary-service markets.

Lever	Action dim	Range	Effect
IT (DVFS)	`action[0]`	[0, 1]	Throttles schedulable Alibaba batch jobs; GenAI/DLRM rigid loads unaffected
Cooling (CDU pump)	`action[1]`	[0, 1]	Modulates liquid cooling pump speed, exploiting thermal inertia
HVAC	`action[2]`	[0, 1]	Zone B air-side fan speed
BESS	`action[3]`	[-1, 1]	Charge (−) / discharge (+) the 15 MWh battery

Action Space (4-D, continuous): [throttle_batch, pump_speed_A, hvac_effort, bess_dispatch]

Observation Space (18-D, normalised):

Index	Name	Range	Description
0	`temp_A_norm`	[0, 2]	Zone A (liquid-cooled GPU) temperature / T_safe
1	`temp_B_norm`	[0, 2]	Zone B (air-cooled CPU) temperature / T_safe
2	`bess_soc`	[0, 1]	Battery state of charge
3	`p_base_norm`	[0, 1]	Rigid IT load (GenAI + DLRM)
4	`p_flex_nom_norm`	[0, 1]	New batch arrivals this tick (trace demand)
5	`p_facility_norm`	[0, 2]	Total facility power
6	`regd_signal`	[-1, 1]	Grid regulation signal (signed)
7	`lmp_norm`	[0, 1]	Locational marginal price
8	`grid_load_norm`	[0, 1]	Regional grid load stress indicator
9	`is_spike`	{0, 1}	GenAI serving spike flag
10	`prev_throttle`	[0, 1]	Previous DVFS throttle
11	`prev_pump_speed`	[0, 1]	Previous pump speed
12	`pue_norm`	[0, 2]	Current Power Usage Effectiveness
13	`T_amb_norm`	[0, 1]	Ambient temperature
14	`freq_dev_norm`	[-1, 1]	Normalised grid frequency deviation (swing equation)
15	`v_pcc_pu`	[0, 1.1]	PCC voltage in per-unit (Thévenin model)
16	`backlog_norm`	[0, 2]	Deferred batch queue depth / p_flex_max (Little's Law queue)
17	`committed_mw_norm`	[0, 1]	Current DR commitment / committed_mw_max

5.3. The NeurIPS Evaluation Metric: The Tracking Reward

The scalar reward received at every 5-second tick has seven additive terms:

$$\begin{aligned} \mathcal{R} =&\; \alpha \cdot u_{\text{thr}} \\\ &- \beta \cdot \frac{|\Delta P_{\text{demand}} - \Delta P_{\text{actual}}|}{P_{\text{norm}}} \\\ &- \gamma \cdot (T - T_{\text{warn}})^{+} \\\ &- \delta_{\text{soc}} \cdot \mathbf{1}_{\text{soc}} \\\ &- \delta_f \cdot (|\Delta f| - 0.2)^{+} \\\ &- \delta_v \cdot \varepsilon_v \\\ &- \delta_q \cdot \frac{Q_{\text{backlog}}}{P_{\text{flex,max}}} \end{aligned}$$

where:

$(x)^{+} = \max(0,, x)$ — ReLU / hinge: only positive exceedances are penalised
$u_{\text{thr}} \in [0,1]$ — DVFS throttle fraction; fraction of flexible batch capacity currently committed
$\Delta P_{\text{demand}} = C_{\text{MW}} \times \text{RegD}(t)$ — MW change requested by the grid operator this tick
$\Delta P_{\text{actual}} = P_{\text{flex,served}} + P_{\text{BESS,actual}}$ — MW change the DC actually delivered
$P_{\text{norm}} = C_{\text{MW}} \times 1000$ — normalisation constant (converts tracking error to a [0, ~2] range)
$T$ — temperature of the hotter of the two cooling zones (°C)
$T_{\text{warn}} = 33,°\text{C}$ — soft warning threshold; thermal penalty begins here, 2 °C before the hard trip
$\mathbf{1}{\text{soc}}$ — binary flag: 1 if BESS state-of-charge is below $\text{SOC}{\min} + 2%$ (i.e. below 12%), else 0
$|\Delta f|$ — absolute grid frequency deviation (Hz) from the 60 Hz nominal
$\varepsilon_v = (0.95 - v_{\text{pcc}})^{+} + (v_{\text{pcc}} - 1.05)^{+}$ — PCC voltage exceedance (pu) outside the ANSI C84.1 Range A band $[0.95, 1.05]$
$Q_{\text{backlog}}$ — deferred batch work currently sitting in the FIFO queue (kW-equivalent)
$P_{\text{flex,max}} \approx 90{,}000,\text{kW}$ — peak flexible IT capacity at full throttle (1,200 racks × 75 kW)
Coefficients (all in config.yaml): $\alpha{=}1.0$, $\beta{=}2.0$, $\gamma{=}5.0$, $\delta_{\text{soc}}{=}0.5$, $\delta_f{=}2.0$, $\delta_v{=}5.0$, $\delta_q{=}2.0$

Term-by-term breakdown

#	Term	Coefficient	What it measures	Why it matters
1	Throughput	$\alpha = 1.0$	Fraction of max IT capacity actually committed ($u_{\text{thr}} \in [0,1]$)	Maximising revenue — the agent earns more for accepting more DFS workload
2	RegD tracking	$\beta = 2.0$	Normalised absolute error between the FERC-requested power change and what the DC actually delivered	The primary ancillary-service obligation — missing this is penalised twice as hard as raw throughput gains
3	Thermal overrun	$\gamma = 5.0$	Degrees above the warning threshold $T_{\text{warn}} = 33°$C for the hotter of the two cooling zones	Linear ramp long before the hard 35 °C trip; $\gamma$ is large enough to dominate at +1 °C overshoot
4	BESS SoC	$\delta_{\text{soc}} = 0.5$	Binary flag: 1 if the battery state-of-charge falls below $\text{SOC}_{\min} + 2%$ (12%)	Flat per-tick penalty prevents the BESS from being stranded near empty when a RegD ramp arrives
5	Frequency deviation	$\delta_f = 2.0$	Frequency excursion beyond the ±0.2 Hz NERC dead-band	Proportional penalty that steepens as the grid approaches the ±0.5 Hz trip threshold
6	Voltage deviation	$\delta_v = 5.0$	One-sided penalty for PCC voltage outside [0.95, 1.05] pu	Voltage violations are fast and dangerous; the large coefficient forces early corrective action
7	SLA backlog	$\delta_q = 2.0$	FIFO queue depth normalised by peak flexible capacity $P_{\text{flex,max}}$	Deferred batch jobs accumulate in queue; this term penalises latency and incentivises draining the queue

Core tension: why the agent must balance throughput vs. tracking

Terms 1 and 2 are structurally opposed:

Higher throttle ($u_{\text{thr}} \uparrow$) → more revenue from IT (term 1 ↑) but increases the power baseline, making it harder to deliver a downward RegD ramp accurately (term 2 ↓).
Lower throttle ($u_{\text{thr}} \downarrow$) → improves tracking flexibility but sacrifices revenue and grows the backlog (term 7 ↓).

The optimal agent learns a lever hierarchy: use BESS charge/discharge first (zero-penalty, fast), then exploit thermal inertia of the cooling system (slow, cheap), and only fall back to DVFS throttling as a last resort. This mirrors real-world FERC-paid frequency regulation.

Coefficient scaling rationale

All coefficients are chosen so that terms land in the same numerical range under typical operation:

$\alpha = 1$ → throughput at $u_{\text{thr}} = 0.8$ contributes $+0.8$ per tick
$\beta = 2$ → a 40% normalised tracking error contributes $-0.8$ per tick
$\gamma = 5$ → 1 °C overshoot contributes $-5$ per tick, dominating immediately
$\delta_v = 5$ → 5% voltage sag contributes $-0.25$ per tick, matching the thermal scale

Tracking loop

The RegD tracking error is computed as:

$$\Delta P_{\text{actual}} = P_{\text{flex,served}} + P_{\text{BESS,actual}}$$

where $P_{\text{flex,served}} = \min!\left(Q_{\text{backlog}},\ P_{\text{flex,max}} \times u_{\text{thr}}\right)$ is the batch work actually served from the FIFO queue this tick, and $P_{\text{BESS,actual}}$ is the net BESS power after battery dynamics.

Cumulative reward scale (per 24-hour episode)

Agent	Typical range	Notes
Random policy	−15,000 to −5,000	Frequent thermal & voltage trips
Rule-based (threshold control)	−2,000 to +500	No backlog awareness
PPO (trained, 5 M steps)	+2,000 to +5,000	Learns lever hierarchy
Adversarial scenario C	−5,000 to −1,000	High ambient temp + price spike

Termination (episode ends immediately):

Thermal fault: $T_A > 35°$C or $T_B > 35°$C
Frequency fault: $|f - f_{\text{nom}}| > 0.5$ Hz (UFLS / over-frequency trip)
Voltage fault: $v_{\text{pcc}} < 0.90$ pu (under-voltage relay)

Episode truncates at 17,280 ticks (24 hours at 5 s).

6. Physics Engines

C2G-Bench exposes exactly two Gymnasium environments — C2GFastEnv and C2GMacroEnv — both registered under gym.make(). Everything below is not an environment: the six physics engines are internal simulation components with no reset()/step() or observation_space/action_space API. They are called exclusively by the two environments and are never exposed to an RL agent directly. If you want to interact with a physics engine in isolation (e.g. for unit testing or analysis), instantiate it directly from c2g_env.physics.*.

Six independent physics/data modules, all with exact-exponential or analytical solutions (unconditionally stable):

Simulator	File	Description
Workload Orchestrator	`workload.py`	Fuses Alibaba batch (2023), DLRM (2025), and GenAI (2026) traces into P_base + P_flex at 5-min resolution. FIFO queue model: unserved batch work defers rather than drops; exposes `backlog_kw` and `avg_delay_steps` (Little's Law) per step
Thermal Twin	`thermal.py`	Exact exponential ODE integration for dual-zone cooling (Zone A: HPE Cray EX liquid, Zone B: HPE ProLiant air)
Electrical Chain	`electrical.py`	Non-linear UPS/PDU/XFMR loss curves + PUE calculation
BESS	`bess.py`	15 MWh / 5 MW Li-ion NMC (pure-Python backend + optional PySAM) with C-rate η, SOC derating, capacity fade
Macro-Grid	`macro_grid.py`	AR(1) RegD signal + LMP proxy; calibrated for 6 global markets
Weather	`weather.py`	NOAA ISD-Lite real data or calibrated synthetic (6 climate profiles)

7. Data

Real Datasets

Dataset	Source	Markets/Zones	Resolution	Files
Workload traces	Alibaba cluster traces	batch, DLRM, GenAI, spot	5-min	4 CSVs
Energy load	EIA, SMARD.de, AEMO	NYISO (11 zones), PJM, CAISO, ERCOT, ENTSO-E DE, AEMO NSW	5-min (resampled)	16 CSVs
Weather	NOAA ISD-Lite	NYC, DCA, SJC, DFW, FRA, BKT	Hourly	7 CSVs

7.1. Workload Traces — Deep Dive

The benchmark fuses three Alibaba production traces to model the IT load of the 250 MW facility. Each trace has a distinct statistical character, hardware zone assignment, and role in the control problem.

Trace Summary

File	Source	Duration	Zone	Role	Controllable?
`batch_v2023.csv`	Alibaba GPU v2023 (`openb_pod_list_default.csv`)	33 days	A (GPU liquid-cooled)	`P_flex` — schedulable batch jobs	✅ DVFS throttle `action[0]` defers work into FIFO queue
`dlrm_v2025.csv`	Alibaba GPU v2025 (`disaggregated_DLRM_trace.csv`)	30 days	B (CPU air-cooled)	`P_base` — rigid DLRM inference serving	❌ Must be served regardless of grid state
`genai_v2026.csv`	Alibaba v2026 GenAI (`qps.csv` and `pod_gpu_duty_cycle_anon.csv`)	1 day (tiled)	A (GPU liquid-cooled)	`P_base` — rigid GenAI inference, spike-prone	❌ Must be served; spikes set `obs[9]=1`

spot_v2026.csv is bundled but excluded from the current release — it requires an arrival-based preemptible scheduler not yet implemented.

To reproduce these processed CSVs from the raw Alibaba data, see preprocessing/workload_traces/. All three files are loaded at startup by c2g_env.physics.workload.WorkloadSimulator.

Power Model

All three utilisation signals are translated to rack-level electrical power via the non-linear server power model (Fan et al., ISCA 2007):

$$P_{server}(u) = N_{racks} \times \bigl[ P_{idle} + (P_{max} - P_{idle}) \cdot u^{\alpha} \bigr]$$

Stream	Racks	$P_{idle}$	$P_{max}$	$\alpha$	Utilisation normaliser
Batch (Zone A flex)	1 200	8 kW/rack	25 kW/rack	1.4 (GPU superlinear)	`gpu_milli_request / 12 620`
GenAI (Zone A base)	800	8 kW/rack	25 kW/rack	1.4	`avg_gpu_duty_cycle / 100`
DLRM (Zone B base)	2 500	4 kW/rack	16 kW/rack	1.2 (CPU inference)	`active_gpu_count / 227`

Resulting power envelope (30-day mean at default scenario):

Stream	Mean power	Max power	Share of total IT
DLRM P_base (Zone B)	~21.7 MW	40.0 MW	56%
Batch P_flex (Zone A)	~10.1 MW	30.0 MW	26%
GenAI P_base (Zone A)	~6.8 MW	8.3 MW	18%
Total IT	~38.5 MW	—	100%

74% of IT power is rigid (P_base) — the agent's primary controllable lever is batch throttling which covers only the remaining 26%.

Trace Characteristics

batch_v2023.csv — Schedulable Batch (P_flex)

Column: gpu_milli_request (sum of GPU milli-cores requested per 5-min tick)
Statistics: 78% of ticks have zero arrivals; mean utilisation ≈ 0.043; max = 12 620 gpu-milli
Nature: Highly bursty. Jobs arrive sporadically with durations from 1–2 825 ticks (5 min to 9.8 days). Unserved work accumulates in a FIFO queue (tracked as backlog_kw and avg_delay_steps via Little's Law).
Agent implication: DVFS throttle (action[0]) directly gates the batch service rate. Throttling below 1.0 reduces thermal load and peak grid draw at the cost of growing backlog. Reward term 1 (throughput) penalises low action[0].

Utilisation distributions: batch is 78% zero (bursty); DLRM is near-Gaussian (always-on); GenAI is multimodal low-duty.

dlrm_v2025.csv — DLRM Inference (P_base, Zone B)

Columns: active_gpu_count, active_cpu_cores, active_mem_gib
Statistics: Always non-zero (min=1 GPU); mean ≈ 101 GPUs; near-Gaussian distribution with a clear two-shift diurnal pattern.
Nature: Continuous, predictable. DLRM (Deep Learning Recommendation Model) serving is the backbone of Zone B — it never drops below idle power. The 30-day trace captures weekday/weekend cycling clearly.
Agent implication: Contributes the largest fixed baseload (~21.7 MW). The only thermal handle for Zone B is HVAC effort (action[2]); DLRM itself cannot be throttled.

genai_v2026.csv — GenAI Serving (P_base, Zone A)

Columns: total_qps, avg_gpu_duty_cycle, active_genai_pods
Statistics: 288 ticks (1 day) tiled cyclically; duty cycle mean ≈ 6.7%, max 24.4%; spike rate ≈ 25%
Nature: Multimodal — most time near-idle, with sharp afternoon QPS bursts. Ticks where avg_gpu_duty_cycle > P75 = 12.19% are flagged as spikes (obs[9] = 1). GenAI runs on the same Zone A GPU racks as batch but with strict SLA priority.
Agent implication: Spikes increase Zone A temperature rapidly (liquid cooling response time τ ≈ 13 min). The safety shield terminates episodes if T_A > 35°C. During spikes the agent must reduce batch load (action[0]) and possibly increase pump speed (action[1]) to prevent thermal fault.

Left: GenAI duty cycle with spike threshold (red dashes) and spike ticks (red dots). Right: spike probability peaks in afternoon hours.

IT Power Breakdown

Stacked IT power (MW) over 30 days and 1-week zoom. The DLRM base (orange) dominates; batch flex (blue) provides the agent's only demand-side handle.

Temporal Correlation

ACF up to 24 hours. DLRM is highly persistent (slow decay with 24-hour periodicity). Batch decorrelates fastest — it is the hardest to predict. GenAI reveals its 1-day tile boundary.

The DLRM trace has the highest autocorrelation (predictable → MPC/rule-based works well for Zone B). Batch is the most volatile (decorrelates within ~2 hours), making it the prime target for RL.

Batch Queue Dynamics

Simulated backlog over 7 days at three throttle levels. At 50% throttle the queue stabilises near zero — the mean arrival rate is well within half-capacity. A completely off agent (throttle=0.3) accumulates ~10e3 MW equivalent backlog in 7 days.

This reveals a key benchmark insight: the batch queue is stable under mild throttle (≥ 40%) because the mean arrival rate (10.1 MW) is only 34% of full capacity (30 MW). The agent does not need to fully commit compute to clear the queue; it has real headroom to throttle for grid regulation.

See notebooks/11_workload_deep_dive.ipynb for full interactive analysis.

6 Global Energy Markets

Market Key	Region	Grid Operator	Energy Source	Weather Station
`nyiso_nyc`	New York City	NYISO	NYISO OASIS	NYC (Central Park)
`pjm_dom`	Northern Virginia	PJM	EIA API	DCA (Reagan Natl)
`caiso_pgae`	Bay Area / San Jose	CAISO	EIA API	SJC (Mineta Intl)
`ercot_north`	Dallas–Fort Worth	ERCOT	EIA API	DFW (DFW Intl)
`entso_de`	Frankfurt, Germany	ENTSO-E / EPEX	SMARD.de	FRA (Frankfurt)
`aemo_nsw`	Sydney, Australia	AEMO / NEM	AEMO CSVs	BKT (Bankstown)

8. Evaluation Scenarios

C2G-Bench ships four progressively harder 24-hour scenarios (17,280 ticks at 5 s each). Every scenario is fully deterministic when a fixed seed is set and can be combined with any of the six energy markets via a single Hydra override.

# Run any scenario × any market
uv run python baselines/train_ppo.py scenario=scenario_b market=ercot_north

8.1. Scene-setting: shared physics

All scenarios share the same underlying simulator stack and reward weights:

Parameter	Value	Meaning
Episode length	17,280 ticks	24 h × 3,600 s h⁻¹ ÷ 5 s tick⁻¹
IT capacity	250 MW	Rigid (GenAI/DLRM) + flexible (Alibaba batch)
BESS	15 MWh / 5 MW	NMC Li-ion, C-rate derating + capacity fade
Cooling zones	Zone A (liquid, HPE Cray EX) · Zone B (air, HPE ProLiant)
$T_{\text{safe}}$	35 °C	Silicon hard limit → immediate termination
$T_{\text{warn}}$	33 °C	Soft threshold → thermal penalty begins
Frequency UFLS	±0.5 Hz	Under/over-frequency relay → termination
Voltage UV relay	0.90 pu	Under-voltage → termination

8.2. `default` — Baseline Operations

"Can the agent learn to coordinate four physical levers under normal grid conditions?"

The entry-level scenario. Ambient temperature is comfortable (25 °C, NYISO NYC summer), BESS starts at 50 % SOC, and the regulation signal has standard amplitude. No faults are injected. This is the recommended starting point for algorithm development and ablation studies.

Parameter	Value
Market	NYISO NYC
Ambient $T_{\text{amb}}$	25 °C (weather-driven)
Committed MW (max)	30 MW
BESS SOC₀	50 %
GenAI spike scale	1.0× (nominal)
Grid stress scale	1.0× (nominal)
Cooling fault	None

Primary challenge: Learning the basic DVFS ↔ cooling ↔ BESS synergy to track the regulation signal while keeping temperatures below $T_{\text{warn}}$.

Termination risk: Low. An untrained random agent survives ≈ 40 % of the episode on average.

8.3. `scenario_a` — GenAI Crisis

"A viral model launch + a grid under-frequency event hit simultaneously. The agent must shed flexible load without starving the BESS."

This scenario models a Northern Virginia (PJM DOM) summer day when a new GPT-class model goes viral. GenAI serving load spikes to 1.8× nominal, consuming headroom that the agent would otherwise use for regulation. At the same time, the grid issues a sustained under-frequency signal, demanding active discharge. The agent must resolve the conflict between IT throughput and grid support.

Parameter	Value
Market	PJM DOM
Ambient $T_{\text{amb}}$	30 °C (static)
Committed MW (max)	40 MW
BESS SOC₀	55 %
GenAI spike scale	1.8×
Grid stress scale	1.5×
Cooling fault	None

Primary challenge: IT vs. grid conflict. The GenAI rigid load is non-throttleable, so the agent must use BESS discharge and batch-job throttling simultaneously — but throttling reduces throughput reward $\alpha \cdot u_{\text{thr}}$, and over-discharging depletes the BESS.

Termination risk: Medium–High. Frequency faults are likely if the agent ignores the regulation signal. Thermal faults are possible if cooling is under-prioritised during spikes.

8.4. `scenario_b` — Thermal Squeeze

"Dallas in August: 40 °C ambient, a 30 MW commitment, and a cooling system pushed to its physical limits."

This scenario targets ERCOT North (DFW) during a peak-summer heat wave. The 40 °C ambient temperature drives the cooling COP down by ≈ 30 %, meaning the pump must work harder to achieve the same heat rejection. The committed MW is raised to 30 MW, increasing the power swings the agent must track. GenAI load is nominal, but the thermal margin to $T_{\text{safe}}$ is extremely thin.

Parameter	Value
Market	ERCOT North
Ambient $T_{\text{amb}}$	40 °C (static)
Committed MW (max)	60 MW
BESS SOC₀	60 %
GenAI spike scale	1.0× (nominal)
Grid stress scale	1.3×
Cooling fault	None

Primary challenge: Thermal constraint binding. The thermal penalty $\gamma \cdot (T - T_{\text{warn}})^{+}$ dominates the reward signal. The agent must learn aggressive pump-speed scheduling and accept reduced throughput to keep temperatures in the safe band.

Termination risk: Very High. A naive agent that ignores the pump lever will hit $T_{\text{safe}} = 35$ °C within the first hour. This scenario is the primary driver of thermal-safety research.

8.5. `scenario_c` — Battery Drain

"Western Sydney summer: the BESS starts nearly empty, the pump is failing, and the grid is stressed."

This scenario represents a compounding failure in AEMO NSW. The BESS begins at only 15 % SOC (near the 10 % hard floor), leaving almost no discharge capacity for regulation. A simulated CDU pump degradation reduces cooling efficiency to 60 % of nominal, tightening the thermal margin. GenAI and grid stress are both elevated. The agent must simultaneously ration the BESS, compensate for degraded cooling, and track the regulation signal — with essentially no buffer.

Parameter	Value
Market	AEMO NSW
Ambient $T_{\text{amb}}$	32 °C (static)
Committed MW (max)	40 MW
BESS SOC₀	15 %
GenAI spike scale	1.2×
Grid stress scale	1.2×
Cooling fault	Pump degradation (60 % efficiency)

Primary challenge: Resource scarcity under compound failure. The BESS SOC penalty $\delta_{\text{soc}}$ activates immediately. The agent must switch to DVFS-only regulation while the pump fault is active, and carefully trickle-charge the BESS when the regulation signal allows.

Termination risk: Extreme. This is the hardest scenario in the benchmark. A random agent terminates within ≈ 5 % of the episode on average.

8.6. Scenario × Market grid

All four scenarios can be combined with all six markets, yielding 24 distinct evaluation configurations. Market selection changes the LMP profile, weather driver, and grid-stress statistics, while scenario selection changes the hardware stress and initial conditions:

	`nyiso_nyc`	`pjm_dom`	`ercot_north`	`aemo_nsw`
`default`	★ default
`scenario_a`		★ default
`scenario_b`			★ default
`scenario_c`				★ default

★ = default market for that scenario. Any other cell is a valid cross-market stress test.

# Example: Thermal Squeeze under European low-carbon prices
uv run python baselines/train_ppo.py scenario=scenario_b market=entso_de experiment.seed=1

9. Repository Structure

C2G-Macro/
├── pyproject.toml                       # uv/hatchling build + all dependencies
├── uv.lock                              # Reproducible dependency lock
├── README.md
│
├── c2g_env/                             # The Core RL Environment
│   ├── __init__.py                      # Exports C2GFastEnv, C2GMacroEnv
│   ├── env_low_level.py                 # 5 s physics step — C2GFastEnv (18-D obs, 4-D act)
│   ├── env_high_level.py                # 15-min market step — C2GMacroEnv (19-D obs, 2-D act)
│   ├── ENVIRONMENTS.md                  # 📖 Full environment & simulator reference (equations, params)
│   ├── config.yaml                      # Centralised env configuration
│   ├── experiments/
│   │   ├── __init__.py                  # Exports ActionAblationFastEnv
│   │   └── action_ablation_env.py       # C2GFastEnv subclass for action-level ablation studies
│   └── physics/
│       ├── workload.py                  # Alibaba trace fusion (batch/DLRM/GenAI)
│       ├── thermal.py                   # Exact-exponential ODEs, dual-zone cooling
│       ├── electrical.py                # Non-linear UPS/PDU/XFMR loss + PUE
│       ├── bess.py                      # 15 MWh NMC BESS (pure-Python + PySAM)
│       ├── macro_grid.py                # AR(1) RegD + LMP proxy, 6 market presets
│       └── weather.py                   # NOAA ISD real data + synthetic climate, 6 presets
│
├── data/
│   └── processed/
│       ├── workload_traces/             # batch_v2023, dlrm_v2025, genai_v2026, spot_v2026
│       ├── energy/                      # 16 CSVs: 11 NYISO zones + PJM/CAISO/ERCOT/ENTSO-E/AEMO
│       └── weather/                     # 7 station CSVs: NYC, DCA, SJC, DFW, FRA, BKT, LONGIL + merged
│
├── conf/                                # Hydra configuration tree
│   ├── config.yaml                      # Top-level defaults (scenario, algo, market, logging)
│   ├── algo/                            # 19 algo configs: ppo, sac, ppo_macro, sac_macro, cpo,
│   │                                    #   ppo_lagrangian, cbf_ppo, hj_ppo, mpcsf_ppo, ha_c2g,
│   │                                    #   cbm_only, cbm_gate, cbm_shield, rule_macro_ppo, pid,
│   │                                    #   mpc_fast, mpc_macro, milp, shield_reward_shaping
│   ├── scenario/                        # default, scenario_a, scenario_b, scenario_c
│   ├── market/                          # nyiso_nyc, pjm_dom, caiso_pgae, ercot_north, entso_de, aemo_nsw
│   └── logging/                         # tensorboard.yaml
│
├── baselines/                           # NeurIPS Evaluation Agents
│   ├── _hydra_compat.py                 # Hydra 1.3.x compatibility patch for Python ≥ 3.14
│   ├── metrics_callback.py              # C2GMetricsCallback — per-episode CSV + TensorBoard
│   │
│   │  # ── Classical Controllers ───────────────────────────────────────────
│   ├── rule_based_mpc.py                # Threshold controller for C2GFastEnv (SB3-compatible)
│   ├── rule_based_macro.py              # Macro-level rule-based controller for C2GMacroEnv
│   ├── bang_bang.py                     # Bang-bang / hysteresis controller (floor baseline)
│   ├── pid_controller.py                # Multi-loop PID controller with anti-windup
│   │
│   │  # ── RL Training Scripts ─────────────────────────────────────────────
│   ├── train_sac.py                     # SB3 SAC (off-policy, auto entropy)
│   ├── train_hierarchical.py            # Two-phase sequential HRL pipeline (PPO inner)
│   ├── train_hierarchical_sac.py        # Two-phase HRL with SAC inner policy
│   ├── train_rule_macro_sac.py          # Rule-based macro + SAC inner policy
│   ├── train_lowsac_highrandom.py       # SAC lower + random macro (ablation)
│   ├── train_llm_agents.py              # LLM-guided agent training
│   │
│   └── safety/                          # HA safety methods + shielded training scripts (see §11)
│
├── evaluation/                          # Benchmark auditing & analysis
│   ├── run_benchmark.py                 # Standard benchmark: runs agents on all 4 scenarios
│   │                                    # Outputs: CSV with cumulative power metrics at
│   │                                    #   evaluation/results/{algo}_{scenario}_{agent_type}_{ablation}.csv
│   ├── run_ha_benchmark.py              # HA safety benchmark: 11-metric evaluation set
│   │                                    # Same cumulative power metrics as run_benchmark.py
│   ├── generate_plots.py                # Publication-ready PDF/PNG figures
│   ├── generate_ha_plots.py             # HA-specific: Pareto frontier, radar, violin plots
│   ├── plot_episode_traces.py           # Per-episode trace analysis with ablation filtering
│   ├── failure_analysis.py              # Failure-case categorisation for HA benchmark
│   └── statistical_analysis.py          # Bootstrap CIs, Welch's t-test, Cohen's d, LaTeX tables
│
├── scripts/                             # Data download & training utilities
│   ├── download_weather.py              # Open-Meteo ERA5 → 6 weather CSVs
│   ├── download_energy.py               # EIA + SMARD + AEMO → 5 energy CSVs
│   └── run_sweep.sh                     # Full training sweep (25 phases, ~270 jobs)
│
├── preprocessing/                       # Raw → processed data pipelines
│   ├── workload_traces/                 # process_v2023.py, process_v2025.py, process_v2026_genai.py
│   ├── energy/                          # process_energy.py (NYISO zone load)
│   └── weather/                         # download_noaa_isd.py
│
├── notebooks/                           # 11 Jupyter notebooks for exploration & visualisation
│   ├── 01_workload.ipynb                # Alibaba trace analysis
│   ├── 02_thermal.ipynb                 # Thermal model step response & steady-state
│   ├── 03_electrical_bess.ipynb         # Electrical chain + BESS cycling
│   ├── 04_macro_grid.ipynb              # RegD signal + LMP proxy
│   ├── 05_environments.ipynb            # Gym API demo, scenario comparison
│   ├── 06_weather.ipynb                 # Weather data: 6 markets, real vs. synthetic
│   ├── 07_energy_markets.ipynb          # Energy load: 6 markets, LDC, diurnal patterns
│   ├── 08_frequency_voltage.ipynb       # Grid frequency & PCC voltage safety signals
│   ├── 09_evaluation_scenarios.ipynb    # Scenario deep dive: params, rollouts, risk, reward
│   ├── 10_baselines_visualization.ipynb # Baseline agent comparison & visualisation
│   └── 11_workload_deep_dive.ipynb      # Workload queue dynamics & trace statistics
│
├── tests/                               # 531 tests (pytest)
│   ├── test_workload.py                 # 24 tests
│   ├── test_thermal.py                  # 32 tests
│   ├── test_electrical.py               # 27 tests
│   ├── test_macro_grid.py               # 30 tests
│   ├── test_weather.py                  # 23 tests
│   ├── test_gym_api.py                  # 72 tests (API compliance both envs)
│   ├── test_baselines.py                # 18 tests
│   ├── test_new_baselines.py            # 50 tests (classical + gradient-free baselines)
│   ├── test_frequency_voltage.py        # 31 tests (freq/voltage safety signals)
│   ├── test_hierarchical.py             # 22 tests (HRL, macro agents)
│   ├── test_safety_shield.py            # 24 tests (Simplex shield, wrappers)
│   ├── test_ha_safety.py                # 70 tests (3-tier HA safety methods)
│   ├── test_critical_bug_fixes.py       # 50 tests (regression tests)
│   ├── test_ablation.py                 # 18 tests (action ablation env)
│   ├── test_readme_smoke.py             # 13 tests (README code snippet validation)
│   ├── test_datalogging.py              # 7 tests (transition logging schema + 5 smoke tests)
│                                        #   Hardware vs macro column validation
│                                        #   5 CLI smoke tests: rule_macro, rule_based,
│                                        #   rule_based+BESS_ablation, ha_rule_based (variants)
│
└── figures/                             # Root-level figures (TensorBoard screenshot, etc.)

10. Quick Start

Prerequisites

Python 3.11 (exact; ==3.11.* in pyproject.toml)
uv — fast Python package manager

curl -LsSf https://astral.sh/uv/install.sh | sh

Clone & install

git clone <repo-url>
cd C2G-Macro
uv sync
uv sync --extra dev   # pytest, ruff, mypy

Run the tests

uv run pytest tests/ -q
# 531 passed

Train a single agent

# PPO — default scenario, 300k steps
uv run python baselines/train_ppo.py

# PPO — GenAI Crisis + PJM market
uv run python baselines/train_ppo.py scenario=scenario_a market=pjm_dom

# SAC — Thermal Squeeze
uv run python baselines/train_sac.py algo=sac scenario=scenario_b

# Hydra multirun — all scenarios × 3 seeds
uv run python baselines/train_ppo.py --multirun \
    scenario=default,scenario_a,scenario_b,scenario_c \
    experiment.seed=1,2,3

# Hierarchical RL — sequential two-phase pipeline
uv run python baselines/train_hierarchical.py

# Safety-shielded PPO (provable constraint satisfaction)
uv run python baselines/safety/train_shielded_ppo.py scenario=default

# Constrained RL — PPO-Lagrangian
uv run python baselines/safety/train_ppo_lagrangian.py scenario=default

# CPO — Constrained Policy Optimization
uv run python baselines/safety/train_cpo.py scenario=default

# CBF-shielded PPO (QP-based action projection)
uv run python baselines/safety/train_cbf_ppo.py scenario=default

# Full HA-C2G neuro-symbolic 3-layer architecture
uv run python baselines/safety/train_ha_c2g.py scenario=default

Run the full benchmark sweep

# Dry-run first — prints all 48 jobs without executing anything:
bash scripts/run_sweep.sh --dry-run

# Full sweep (default: 4 parallel jobs):
bash scripts/run_sweep.sh

# Use more parallelism (208 cores available — 16 is safe):
MAX_PARALLEL=16 bash scripts/run_sweep.sh

The sweep runs in 25 phases:

Phase	Jobs	What runs
1	24	Rule-Based + Random evaluation only (no training, ~5 min)
2	12	PPO training (300k steps) + evaluation
3	12	SAC training (200k steps) + evaluation
4	12	Macro Rule-Based evaluation
5	12	PPO-Macro training (100k steps) + evaluation
6	12	HRL sequential training (300k + 100k) + evaluation
7	36	Bang-Bang, PID, MPC evaluation (no training)
8	24	MPC-Macro & MILP evaluation (no training)
9	12	PPO-Lagrangian training (300k) + evaluation
10	12	CBF-PPO training (300k) + evaluation
11	12	HJ-PPO training (300k) + evaluation
12	12	MPC-SF-PPO training (300k) + evaluation
13	12	CPO training (300k) + evaluation
14	12	Shield-Reward-Shaping training (300k) + evaluation
15	12	HA-C2G neuro-symbolic training (300k) + evaluation
16	12	CBM-Only ablation training (300k)
19	12	CBM+Gate ablation training (300k)
20	12	CBM+Shield ablation training (300k)
21	1	HA Benchmark evaluation (11 metrics, 5 episodes)
22	1	Summary table + LaTeX rows
23	1	Multi-seed HA benchmark (10 seeds × 5 episodes)
24	1	Statistical analysis (CIs + significance tests)
25	1	Failure-case analysis

Results are written to results/sweep_results.csv (one row per run, upserted on re-runs) and results/sweep_summary.csv (mean ± std across seeds).

Run benchmark evaluation directly

Use the evaluation runners when you want targeted experiments instead of the full sweep. The --fixed-action setting allows granular control experiments by pinning selected actuators to analyst-chosen setpoints.

Unless --output path provided, results saved by default at:

evaluation/results/{algo}_{scenario}_{agent_type}_{ablation}.csv

e.g. ppo_scenario_b_hardware_BESS_0.5.csv stores evals for hardware PPO agent with fixed BESS ablation. Here agent_type denotes the transition-logging and output suffix for the evaluated controller, and can be hardware, macro, or hardware_ha.

Standard benchmark runner

Runs any combination of agents across all four evaluation scenarios and writes per-episode metrics to CSV.

# Classical hardware controllers (no trained models needed)
uv run evaluation/run_benchmark.py --agents rule_based bang_bang pid random

# SAC low-level agent (requires a trained model)
uv run evaluation/run_benchmark.py --agents sac --scenarios default scenario_b \
    --hw-model-dir trained_models/sac_default_s100

# Hierarchical combos: rule-based macro + hardware controller
uv run evaluation/run_benchmark.py --agents rule_macro+sac rule_macro+rule_based \
    rule_macro+pid rule_macro+bang_bang rule_macro+random \
    --hw-model-dir trained_models/sac_default_s100

# RL macro (Phase 2) + frozen SAC low-level
uv run evaluation/run_benchmark.py --agents sac_macro+sac \
    --macro-model-dir trained_models/sac_macro_default_s100 \
    --hw-model-dir trained_models/sac_default_s100

# LLM macro + hardware controller (requires a running vLLM server)
uv run evaluation/run_benchmark.py --agents llm_policy_macro+sac \
    --hw-model-dir trained_models/sac_default_s100 \
    --llm-api-base http://localhost:8000/v1

# With transition logging (per-step CSV traces)
uv run evaluation/run_benchmark.py --agents rule_macro+sac --record_transitions \
    --hw-model-dir trained_models/sac_default_s100

SAC agents automatically load the model from trained_models/<algo>_<scenario>_s<seed>/final_model.zip. Use --hw-model-dir or --macro-model-dir to override.

Agents used in the paper:

Agent	Type	Description
`random`	hardware	Uniform random baseline (lower bound)
`bang_bang`	hardware	Hysteresis on/off controller
`pid`	hardware	Multi-loop PID with anti-windup
`rule_based`	hardware	Threshold heuristic controller (`baselines/rule_based_mpc.py`)
`sac`	hardware	Trained SAC low-level controller (Phase 1)
`rule_macro`	macro	Rule-based macro bidding controller
`sac_macro`	macro	Trained SAC macro controller (Phase 2)
`llm_policy_macro`	macro	LLM macro controller (Qwen3-32B, ICRL)
`<macro>+<hardware>`	combo	Macro agent paired with hardware agent, e.g. `rule_macro+sac`, `llm_policy_macro+pid`

Fixed-action ablations (Appendix M):

Pin actuators to fixed setpoints to isolate each lever's contribution:

# Disable BESS (throttle + cooling only)
uv run evaluation/run_benchmark.py --agents rule_macro+sac \
    --fixed-action bess_dispatch=0.0 \
    --hw-model-dir trained_models/sac_default_s100

# Disable BESS and fix cooling (throttle only)
uv run evaluation/run_benchmark.py --agents rule_macro+sac \
    --fixed-action bess_dispatch=0.0 \
    --fixed-action pump_speed_A=0.7 \
    --fixed-action hvac_effort=0.7 \
    --hw-model-dir trained_models/sac_default_s100

Action bounds: throttle_batch ∈ [0, 1], pump_speed_A ∈ [0, 1], hvac_effort ∈ [0, 1], bess_dispatch ∈ [-1, 1]. Values are validated and clipped to these ranges.

Key CLI arguments:

Argument	Default	Description
`--agents`	`rule_based bang_bang pid random`	One or more agent names (see table above)
`--scenarios`	all 4	Subset of `default scenario_a scenario_b scenario_c`
`--n_episodes`	`5`	Episodes per agent × scenario combination
`--seed`	`100`	Starting RNG seed; episode `i` uses `seed + i`
`--hw-model-dir`	`None`	Model directory for the hardware/inner SAC agent
`--macro-model-dir`	`None`	Model directory for the macro-level SAC agent
`--output`	auto-generated	Output CSV path; defaults to `evaluation/results/<algo>_<scenario>_<agent_type>_<ablation>.csv`
`--record_transitions` / `--no-record_transitions`	disabled	Write per-step state/action/reward traces to `runs/<agent>_<scenario>_<type>/episode*.csv`
`--append`	`False`	Append rows to an existing CSV instead of overwriting
`--fixed-action <name>=<value>`	none	Pin an actuator to a fixed setpoint (repeatable)
`--llm-api-base`	`http://localhost:8000/v1`	vLLM / OpenAI-compatible server URL
`--llm-template-path`	`conf/chat_templates/run_benchmark_rbc+ICRL.yaml`	YAML prompt templates for LLM agents
`--llm-max-new-tokens`	`8192`	Maximum tokens per LLM generation step
`--llm-temperature`	`0.0`	Sampling temperature (0 = greedy)
`--llm-no-thinking`	off	Disable `<think>` reasoning blocks
`--llm-context-num-steps`	`10`	ICRL rolling buffer size in past steps (paper uses 5; 0 = disabled)
`--llm-context-stride`	`1`	Store every K-th step in the ICRL buffer
`--llm-icrl-mode`	`autonomous`	ICRL instruction mode: `autonomous` (paper default), `preset`, or `exploit`

Output metrics:

For hardware agents, each row in the output CSV contains:

Column	Description
`mean_reward`	Mean step reward over the episode
`total_reward`	Sum of step rewards
`tracking_rmse`	RMSE of `ΔP_demanded − ΔP_actual` (kW)
`thermal_viol_rate`	Fraction of ticks with temperature > T_warn (33 °C)
`throughput_ratio`	Mean `p_flex_served / p_flex_nom`
`bess_degradation`	Cumulative capacity fade × 10⁴
`episode_length`	Ticks completed (< 17 280 indicates early termination)
`survival_rate`	Fraction of episodes surviving to 24 h

For macro agents, additional columns include:

Column	Description
`bid_acceptance_rate`	Fraction of 15-min bids accepted by the grid
`total_reg_revenue`	Cumulative regulation revenue (USD)
`mean_perf_score`	Mean FERC performance score
`mean_committed_mw`	Mean accepted MW commitment per interval

For hierarchical combo agents (macro+hardware), results are split into *_macro.csv and *_hardware.csv automatically, with a separate hardware-schema row for the inner controller enabling direct comparison with standalone hardware results.

When --record_transitions is enabled, per-step logs are written under runs/<agent>_<scenario>_<agent_type>/episode*.csv.

High-assurance benchmark runner

uv run evaluation/run_ha_benchmark.py --agents simplex_ppo cbf_ppo hj_ppo
uv run evaluation/run_ha_benchmark.py --agents ha_c2g --scenarios default scenario_c --n_episodes 5
uv run evaluation/run_ha_benchmark.py --agents cbf_ppo --record_transitions
uv run evaluation/run_ha_benchmark.py --agents cbf_ppo --no-record_transitions
uv run evaluation/run_ha_benchmark.py --fixed-action bess_dispatch=0.0
uv run evaluation/run_ha_benchmark.py \
  --fixed-action hvac_effort=0.9 \
  --fixed-action bess_dispatch=0.0

Key options:

--agents: HA agents to evaluate
--scenarios: scenarios to run
--n_episodes: number of episodes per agent/scenario
--seed: starting seed
--model_dir: optional override for trained model directory
--output: output CSV path
--record_transitions / --no-record_transitions: enable or disable per-step transition logging
--fixed-action action=value: assign a fixed value to an action

Notes:

These settings allow granular experimentation and control for high-assurance studies as well: you can evaluate whether a safety method still works when specific actuators are pinned to fixed operating points.
The same continuous low-level action ranges apply here: throttle_batch ∈ [0, 1], pump_speed_A ∈ [0, 1], hvac_effort ∈ [0, 1], and bess_dispatch ∈ [-1, 1].
Fixed-action overrides are applied inside the low-level environment before dynamics are applied.
When enabled, transition logs are written under runs/<agent>_<scenario>_ha/episode*.csv.

Plotting episode traces and statistics

After generating transition logs (via --record_transitions in benchmark runners), you can visualize per-step state, action, observation, and reward traces as aggregated statistics (mean ± 99% CI across episodes). Writes per-step episode CSV files under runs/<algo>_<scenario>_<agent_type>/ (e.g., episode0__HVAC_disabled_BESS_0.csv).

Plot episode statistics:

# Basic usage (no ablation)
uv run evaluation/plot_episode_traces.py --algoname bang_bang --scenario default --agent-type hardware

# With ablation filters (plots only episodes matching specific disabled/fixed actions)
uv run evaluation/plot_episode_traces.py \
  --algoname bang_bang \
  --fixed-action pump_speed_A=0.25 \
  --scenario default \
  --agent-type macro

Outputs:

JPEG: figures/<algo>_<scenario>_<agent_type>[__ABLATION_SUFFIX].jpeg
PDF: figures/<algo>_<scenario>_<agent_type>[__ABLATION_SUFFIX].pdf

Each figure contains one subplot per state/reward column, shows the mean line (solid) with a 99% confidence band (shaded area) computed across all matching episodes.

State variables (blue), with 0–1 reference bounds shown as dashed lines
Cumulative reward components (red)

Download real-world data (optional — CSVs are bundled)

uv run python scripts/download_weather.py --year 2024
uv run python scripts/download_energy.py  --year 2024

Monitor training with TensorBoard

All training scripts log scalar metrics (episode reward, episode length, thermal/tracking/SOC penalties, shield interventions) to TensorBoard. Logs are written to the Hydra output directory under tensorboard/.

# Point TensorBoard at the outputs directory to compare all runs:
uv run tensorboard --logdir outputs/

# Or at a specific run:
uv run tensorboard --logdir outputs/ppo_default/seed_42/2026-04-08_21-00-00/tensorboard/

Then open http://localhost:6006 in your browser.

Explore interactively

uv run jupyter lab notebooks/

Note: The optional nrel-pysam BESS backend requires uv pip install nrel-pysam. The environment automatically falls back to the pure-Python _SimpleBESSModel if absent.

11. High-Assurance Safety Controllers

C2G-Bench provides a comprehensive 3-tier high-assurance (HA) safety framework for grid-interactive data center control. All tiers enforce the same 5 hard constraints (C1–C5) and are evaluated with an 11-metric set (6 standard + 5 HA-specific).

Hard Constraints

ID	Constraint	Threshold	Physical Meaning
C1	$T_A < T_{\text{safe}}$	35 °C (margin 1 °C)	Server room A thermal limit
C2	$T_B < T_{\text{safe}}$	35 °C (margin 1 °C)	Server room B thermal limit
C3	SOC ∈ [SOC_min, SOC_max]	[0.10, 0.95] (guard 0.03)	BESS operational envelope
C4	$	\Delta f	< 0.5$ Hz
C5	$V_{\text{pcc}} > 0.90$ pu	0.92 pu trigger	Under-voltage relay threshold

Tier 1 — Hard-Guarantee Methods (provable safety via optimisation)

Method	Shield	Permissiveness	Cost	File
Simplex [Sha 2001]	O(1) analytic worst-case bounds	Conservative	Negligible	`baselines/safety/safety_shield.py`
CBF [Ames 2019]	QP projection into barrier-safe set	Moderate	Low	`baselines/safety/cbf_shield.py`
HJ Reachability	Offline BRS + runtime override	Moderate	Offline high, runtime low	`baselines/safety/hj_shield.py`
MPC Safety Filter	Receding-horizon constrained NLP	Most permissive	Highest online	`baselines/safety/mpc_safety_filter.py`

Simplex Shield — Three Usage Modes

# 1. Standalone filter — works with ANY agent
from baselines.safety.safety_shield import SafetyShield
shield = SafetyShield()
safe_action, was_modified, info = shield.filter(raw_action, obs)

# 2. Gymnasium wrapper — agent trains inside safe manifold
from baselines.safety.safety_shield import ShieldedEnv
env = ShieldedEnv(C2GFastEnv(scenario="default"))

# 3. SB3-compatible agent wrapper — for evaluation
from baselines.safety.safety_shield import ShieldedAgent
safe_agent = ShieldedAgent(trained_agent, env)

Training with Tier 1 Shields

# Simplex-shielded PPO
uv run python baselines/safety/train_shielded_ppo.py scenario=default experiment.seed=42

# CBF-shielded PPO (QP-based, more permissive than Simplex)
uv run python baselines/safety/train_cbf_ppo.py scenario=default

# HJ reachability-shielded PPO (offline BRS computation)
uv run python baselines/safety/train_hj_ppo.py scenario=default

# MPC safety filter PPO (receding-horizon, most permissive)
uv run python baselines/safety/train_mpcsf_ppo.py scenario=default

Tier 2 — Constrained RL (soft constraint satisfaction during training)

Method	Mechanism	File
PPO-Lagrangian	Adaptive Lagrange multipliers for 3 cost types	`baselines/safety/train_ppo_lagrangian.py`
CPO [Achiam 2017]	Trust-region with conjugate gradient + line search	`baselines/safety/train_cpo.py`
Shield Reward Shaping	Fixed quadratic distance-to-boundary penalties	`baselines/safety/train_shield_reward_shaping.py`

Tier 3 — Neuro-Symbolic HA-C2G Architecture

The full HA-C2G pipeline is a 3-layer neuro-symbolic architecture:

Layer 1 — Concept Bottleneck Model (baselines/safety/concept_bottleneck.py): Maps raw 17-D obs → ~10 interpretable safety concepts (thermal margins, SOC health, etc.)
Layer 2 — Safe Projection Gate (baselines/safety/safe_projection.py): Concept-guided differentiable projection that blends policy actions toward safe priors based on learned pass-through gates; applied consistently during training and evaluation for ha_c2g and cbm_gate
Layer 3 — Physics Rule Shield: In-the-loop Simplex shield with shield-penalty reward

Ablation studies isolate each layer's contribution:

Variant	CBM	Gate	Shield	File
HA-C2G (full)	✅	✅	✅	`baselines/safety/train_ha_c2g.py`
CBM-Only	✅	❌	❌	`baselines/safety/train_cbm_only.py`
CBM+Gate	✅	✅	❌	`baselines/safety/train_cbm_gate.py`
CBM+Shield	✅	❌	✅	`baselines/safety/train_cbm_shield.py`

Proof trees (baselines/safety/proof_tree.py) generate per-timestep hierarchical audit logs documenting which safety rules passed/failed and the sensor readings grounding each decision.

HA Evaluation Metrics (11-D)

Category	Metric	Description
Standard	`mean_reward`	Mean episode reward
Standard	`tracking_rmse`	RegD tracking RMSE
Standard	`thermal_viol_rate`	Fraction of ticks with thermal violation
Standard	`throughput_ratio`	Fraction of max IT capacity served
Standard	`bess_degradation`	Battery capacity fade over episode
Standard	`survival_rate`	Fraction of episodes surviving to 24 h
HA	`hard_violation_rate`	Rate of C1–C5 constraint violations
HA	`shield_intervention_rate`	How often the shield overrides the agent
HA	`constraint_margin`	Mean distance from nearest constraint boundary
HA	`worst_case_margin`	Minimum margin across all constraints
HA	`computational_overhead_ms`	Per-step shield compute time

Evaluation & Analysis Tools

# HA benchmark evaluation (11 metrics across all HA agents)
uv run evaluation/run_ha_benchmark.py --agents simplex_ppo cbf_ppo hj_ppo mpcsf_ppo ha_c2g

# HA-specific plots (Pareto frontier, radar, violin, LaTeX table)
uv run evaluation/generate_ha_plots.py

# Failure-case analysis (where/why/how often agents fail)
uv run evaluation/failure_analysis.py

# Statistical analysis (bootstrap CIs, Welch's t-test, Cohen's d)
uv run evaluation/statistical_analysis.py

12. Strategic Value

For the Energy System

Renewable Integration: Data centers absorb excess wind/solar, preventing curtailment.
Grid Stability: The DC acts as a "shock absorber" for the transmission grid, reducing reliance on fossil-fuel peaker plants.

For AI Research (NeurIPS 2026)

Cyber-Physical Benchmark: The first high-fidelity, multi-market testbed for hierarchical RL on real infrastructure physics.
Six Global Markets: NYISO, PJM, CAISO, ERCOT, ENTSO-E, AEMO — largest DC hubs on Earth.
DOE Genesis Alignment: 250 MW–1 GW scale matches the US national AI infrastructure program.

13. Citation

@inproceedings{c2gbench2026,
  title     = {{C2G-Bench}: A Cyber-Physical Benchmark for Grid-Interactive
               Hyperscale Data Centres},
  author    = {Anonymous},
  booktitle = {NeurIPS 2026 Datasets and Benchmarks Track},
  year      = {2026},
}

14. Figure Gallery

All figures are generated by the notebooks in notebooks/ and can be reproduced by running uv run jupyter lab notebooks/.

Workload Traces (`01_workload.ipynb`)

Left-to-right, top-to-bottom: Fused workload time-series (rigid GenAI/DLRM + flexible batch + spot); power histogram per trace type; GenAI spike characterisation showing burst magnitude and inter-arrival distribution; DVFS throttle curve mapping throttle level to actual batch power reduction.

Thermal Twin (`02_thermal.ipynb`)

Step response (cold-start to thermal equilibrium); steady-state map (temperature vs. pump speed); COP degradation with ambient temperature (ERCOT 40 °C peak visible); HVAC parameter sweep; fault injection showing temperature excursion under 60% pump efficiency (Scenario C).

Electrical Chain & BESS (`03_electrical_bess.ipynb`)

Power breakdown across the facility electrical chain (IT → UPS → PDU → transformer); PUE surface showing how ambient temperature and load interact; UPS non-linear efficiency curve; BESS charge/discharge cycle with SOC tracking and capacity fade; round-trip efficiency vs. C-rate.

Macro-Grid Signal (`04_macro_grid.ipynb`)

24-hour regulation signal (AR(1) calibrated per market); power spectral density and statistics; LMP proxy across 6 global markets showing diurnal and seasonal patterns; ACF plot confirming AR(1) calibration quality.

Environment API & Rollouts (`05_environments.ipynb`)

C2GFastEnv 24-hour rollout (temperature, SOC, power, reward traces); reward component breakdown (tracking error, thermal penalty, SOC penalty, freq/voltage penalties); step-reward distribution; observation space coverage showing all 16 dimensions are exercised; C2GMacroEnv rollout at 15-min resolution; cross-scenario reward comparison.

Weather Data (`06_weather.ipynb`)

Annual temperature profiles for all 6 markets (NYC, DCA, SJC, DFW, FRA, BKT); annual distribution; diurnal patterns by season; synthetic vs. real NOAA ISD validation; implied COP showing how weather drives cooling cost; normalised histograms; southern hemisphere seasonal inversion (AEMO NSW summer = January).

Energy Markets (`07_energy_markets.ipynb`)

Annual grid load (NYISO 11-zone, PJM DOM, CAISO PG&E, ERCOT North, ENTSO-E DE, AEMO NSW); diurnal patterns; load duration curves + LMP distribution; grid stress indicator calibration used by macro_grid.py; joint weather–energy distribution (ambient temperature vs. LMP — key for thermal-economic co-optimisation).

Evaluation Scenarios (`09_evaluation_scenarios.ipynb`)

Parameter overview (6 bar charts: T_amb, committed MW, BESS SOC₀, GenAI scale, grid stress, cooling efficiency); radar chart showing the overall stress fingerprint of each scenario; 2-hour rollout traces across all 4 scenarios for 5 physical signals; termination risk under 30 random-policy episodes; cumulative reward gap between scenarios; 24-configuration grid of all Scenario × Market pairings.

Zone temperature comparison across all 4 scenarios showing the thermal headroom difference driven by T_amb and committed MW settings.

Name		Name	Last commit message	Last commit date
Latest commit History 214 Commits
baselines		baselines
c2g_env		c2g_env
conf		conf
data/processed		data/processed
evaluation		evaluation
figures		figures
notebooks		notebooks
preprocessing		preprocessing
scripts		scripts
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation