This project addresses the "AI-Energy Paradox" by transforming 250 MW+ hyperscale data centers from passive power consumers into active, grid-balancing assets. By establishing a formal Energy System Handshake, we enable data centers to provide wholesale Frequency Regulation, stabilizing the regional transmission grid in exchange for significant revenue and faster deployment permits.
We solve this using a Hierarchical AI Orchestration framework that bridges long-term energy market bidding (minutes/hours) and sub-second hardware physics. The framework evaluates the synergy between three critical control levers: Throttling Batch Workloads (DVFS), Modulating Cooling Thermal Inertia (CDU pump), and Dispatching Battery Energy Storage (BESS). This project delivers a high-fidelity cyber-physical benchmark for NeurIPS 2026, at the frontier of autonomous, grid-interactive infrastructure.
Every power grid must keep its frequency exactly at 60 Hz (US) or 50 Hz (EU) at all times. When a generator trips offline or a large load turns on suddenly, frequency deviates. Grid operators use Automatic Generation Control (AGC) to recruit fast-response providers — assets that can inject or absorb power within seconds to correct the imbalance.
FERC Order 755 (2011) created a pay-for-performance market for exactly this. Instead of paying only for available capacity (MW committed), it mandates that grid operators also pay for accuracy — how precisely an asset tracks the real-time regulation signal. PJM (the largest US grid operator) implemented this as the RegD signal: the "D" stands for dynamic, meaning it is designed for fast-response resources such as batteries and flexible loads.
Every 2–5 seconds, the grid operator broadcasts a normalized score:
The sign convention is:
| Signal value | Grid instruction | Data center must... |
|---|---|---|
| +1 | Grid has excess load — reduce grid draw | Shed batch load, discharge BESS, or slow cooling |
| −1 | Grid has excess generation — absorb more | Increase batch load, charge BESS, or raise cooling |
| 0 | Balanced | Hold current power level |
The actual MW response required is:
where committed_mw) is the regulation capacity the data center has pre-contracted to the market for the current 15-minute settlement interval.
The RegD signal is statistically modelled as a first-order autoregressive (AR(1)) process — persistent but zero-mean. At the 5-minute scale it has autocorrelation ρ ≈ 0.80, which time-scales to ρ ≈ 0.997 at the 5-second simulation step used in C2G-Bench. The signal averages to zero over a settlement period, meaning the data center neither gains nor loses net energy from providing regulation.
In c2g_env/physics/macro_grid.py:
self._regd_state = rho * self._regd_state + sigma * noise # AR(1)
regd = np.clip(self._regd_state, -1.0, 1.0) # normalise to [-1,1]Under FERC Order 755, the performance score is the correlation between the demanded signal and the actual response. A score of 1.0 = perfect tracking; 0.0 = random; below a threshold (typically 0.75) results in zero payment and market suspension. This maps directly to the β tracking term in the C2G reward function.
A 250 MW hyperscale facility has three fast-response levers unavailable to most grid assets:
- Batch compute DVFS — schedulable HPC/AI training jobs can be throttled in milliseconds via CPU/GPU frequency scaling. Service capacity is capped at
p_flex_max × throttle(~90 MW); unserved work is deferred into a FIFO queue (not dropped) and served when capacity recovers. Average queue delay is tracked via Little's Law and exposed inobs[16](backlog_norm). - BESS — the on-site 15 MWh / 5 MW battery can charge or discharge at full rate in under 100 ms, providing the fastest regulation response.
- Thermal inertia (CDU pump) — the liquid cooling loop acts as a thermal capacitor (τ ≈ 12.7 min). Slowing the pump briefly stores heat in the water loop without immediately raising server temperatures, providing ~5–10 MW of additional regulation headroom for short intervals.
These three levers in combination can follow a RegD signal far more accurately than a single-asset provider, while the hierarchical RL agent learns the optimal trade-off between grid revenue, compute throughput, and thermal safety.
Current data center management systems are "grid-blind": they optimize internal efficiency (PUE) while ignoring the real-time needs of the regional energy system.
- The Grid Need: Modern grids require large loads to respond to Frequency Regulation signals (e.g., PJM RegD) every 2–4 seconds to balance renewable energy volatility.
- The Datacenter Barrier: Standard AI controllers cannot track these high-speed signals because they do not account for the non-linear physics of liquid cooling, battery degradation, and the bursty nature of GenAI workloads.
- The Objective: Create a synergy where the data center matches the grid's power signal perfectly without violating hardware safety limits or AI training SLAs.
| SOTA | Gap | Our Step Further |
|---|---|---|
| Wang et al., 2019 — Proved DCs can follow grid signals using DVFS. | Used "dummy loads" to intentionally waste power to meet the signal. | We use BESS + thermal storage synergy — no wasted power. |
| Fu et al., 2021 — Demonstrated cooling systems have "thermal inertia" for grid services. | Relies on classical MPC, which fails under unpredictable GenAI serving spikes. | We replace MPC with Hierarchical RL to handle extreme, non-linear volatility of Alibaba GenAI traces. |
| Li et al., 2026 — Identifies the need for intelligent VPP aggregation. | Lacks a standardized, high-fidelity physical testbed for datacenters. | We provide the first 250 MW-scale evaluation testbed with real data across 6 global energy markets. |
C2G-Bench defines a two-level hierarchical Markov Decision Process. The two agents share no parameters and communicate only through the inner_action_fn interface.
| Symbol | Definition |
|---|---|
| Normalised observation vector (see §5.2 for index definitions) | |
| Continuous 4-D action: throttle, pump speed, HVAC effort, BESS dispatch | |
| Deterministic physics step + stochastic AR(1) RegD signal (see §2) | |
| 7-term scalar reward (see §5.3) | |
| Training discount; undiscounted episodic sum used for benchmark ranking | |
| Steps per episode (24 h at 5 s per step) |
The only stochasticity in
Terminal states: the episode ends early on three hard constraints — thermal fault (
The macro agent is framed as a Semi-MDP (Sutton et al., 1999) with fixed option duration
| Symbol | Definition |
|---|---|
| Aggregated sub-step states: component-wise means + SOC endpoint + extrema + market context | |
2-D: bid_mw_norm (MW capacity to offer), bid_price_norm (asking price) |
|
|
|
|
| $\lambda_{\text{rev}} \times \text{regulation revenue} + \bar{r}K - \lambda{\text{elec}} \times \text{electricity cost} - \lambda_{\text{churn}} \times | |
|
|
|
| Macro steps per episode (24 h |
where
and
The macro agent never directly observes the 5-second physics — it sees only the aggregated
Manages the "Business Handshake." Observes regional market prices, weather forecasts, and the Alibaba batch job queue.
Decision: "How much MW capacity should I bid to the grid operator, and at what price, for the next 15 minutes?"
Grid operators (PJM, ERCOT, etc.) clear ancillary-service markets in 15-minute settlement intervals. The MacroEnv implements a 3-phase market handshake each macro step: (1) the grid posts its RMCP and residual regulation need, (2) the DC agent bids MW capacity at an asking price, and (3) the grid probabilistically accepts the bid via a sigmoid function. If rejected, the DC falls back to a standing Demand Response (DR) baseline contract. The macro agent's challenge is to bid optimally under uncertainty about the next 180 RegD ticks, the next GenAI spike, and how much thermal headroom will remain at the end of the interval. The correct strategy is context-dependent — bid aggressively when the BESS is full and LMP is high; bid conservatively when ambient temperature is near the thermal limit or SOC is low. The macro agent never sees individual 5-second ticks; it only receives an aggregated summary after the interval completes, making this a partially observable planning problem.
-
Action Space (2-D):
[bid_mw_norm ∈ [0,1], bid_price_norm ∈ [0,1]]— MW capacity to offer (mapped to[0, committed_max_mw]) and asking price (mapped to[0, 2 × rmcp_max]). -
Observation Space (19-D): Aggregated over 180 sub-steps:
Index Name Range Description 0 temp_A_mean[0, 1] Mean Zone A temperature / T_safe 1 temp_B_mean[0, 1] Mean Zone B temperature / T_safe 2 bess_soc_end[0, 1] SOC at end of the macro-step 3 p_base_mean[0, 1] Mean p_base_norm 4 p_facility_mean[0, 2] Mean p_facility_norm 5 regd_mean[0, 1] Mean 6 lmp_mean[0, 1] Mean lmp_norm 7 grid_load_mean[0, 1] Mean load_norm 8 tracking_err_mean[0, 2] Mean 9 is_spike_any{0, 1} 1.0 if any sub-step had a GenAI spike 10 thermal_headroom_A[0, 1] (T_safe − T_A_max) / T_safe 11 thermal_headroom_B[0, 1] (T_safe − T_B_max) / T_safe 12 bid_mw_prev_norm[0, 1] Previous macro-action bid MW 13 bid_price_prev_norm[0, 1] Previous macro-action bid price 14 freq_dev_mean[-1, 1] Mean normalised frequency deviation 15 v_pcc_mean[0, 1.1] Mean PCC voltage (per-unit) 16 backlog_norm_mean[0, 2] Mean batch queue depth / p_flex_max 17 rmcp_norm[0, 5] Grid's posted RMCP / rmcp_max 18 reg_need_norm[0, 5] Grid's residual regulation need / committed_max -
Reward: $R_{\text{macro}} = \lambda_{\text{rev}} \times \text{regulation_revenue} / 1000 + \bar{r}K - \lambda{\text{elec}} \times \text{electricity_cost} / 1000 - \lambda_{\text{churn}} \times |\text{bid_mw_now} - \text{bid_mw_prev}|$
-
$\lambda_{\text{rev}} = 1.0$ ,$\lambda_{\text{elec}} = 0.5$ ,$\lambda_{\text{churn}} = 0.05$ (all inc2g_env/config.yaml)
-
Executes the physical "Handshake." Receives the real-time frequency regulation signal and uses four physical levers. The central difficulty is that these levers have fundamentally different dynamics, costs, and side-effects — the agent must learn to combine them in the right order:
| Lever | Response time | Capacity | Side-effect | Role |
|---|---|---|---|---|
BESS action[3] |
<100 ms | 5 MW / 15 MWh | Depletes; capacity fade over time | First resort — fastest, zero-penalty, finite |
CDU pump action[1] |
Minutes (τ ≈ 12.7 min) | ~30 MW equivalent | Thermal inertia; slow and partially irreversible on short intervals | Second resort — exploits physics cheaply |
HVAC action[2] |
Seconds | ~50 MW draw | Affects Zone B only; draws additional facility power | Defensive — prevents thermal fault, not a primary regulation lever |
IT throttle action[0] |
Milliseconds | Up to full flex load | Accrues FIFO backlog; cuts throughput revenue | Last resort — immediate but highest SLA cost |
The optimal policy learns this hierarchy: use BESS first (fast, free), borrow thermal inertia second (slow, cheap), fall back to DVFS only when both are exhausted. This mirrors how FERC-paid fast-response providers operate in real ancillary-service markets.
| Lever | Action dim | Range | Effect |
|---|---|---|---|
| IT (DVFS) | action[0] |
[0, 1] | Throttles schedulable Alibaba batch jobs; GenAI/DLRM rigid loads unaffected |
| Cooling (CDU pump) | action[1] |
[0, 1] | Modulates liquid cooling pump speed, exploiting thermal inertia |
| HVAC | action[2] |
[0, 1] | Zone B air-side fan speed |
| BESS | action[3] |
[-1, 1] | Charge (−) / discharge (+) the 15 MWh battery |
- Action Space (4-D, continuous):
[throttle_batch, pump_speed_A, hvac_effort, bess_dispatch] - Observation Space (18-D, normalised):
Index Name Range Description 0 temp_A_norm[0, 2] Zone A (liquid-cooled GPU) temperature / T_safe 1 temp_B_norm[0, 2] Zone B (air-cooled CPU) temperature / T_safe 2 bess_soc[0, 1] Battery state of charge 3 p_base_norm[0, 1] Rigid IT load (GenAI + DLRM) 4 p_flex_nom_norm[0, 1] New batch arrivals this tick (trace demand) 5 p_facility_norm[0, 2] Total facility power 6 regd_signal[-1, 1] Grid regulation signal (signed) 7 lmp_norm[0, 1] Locational marginal price 8 grid_load_norm[0, 1] Regional grid load stress indicator 9 is_spike{0, 1} GenAI serving spike flag 10 prev_throttle[0, 1] Previous DVFS throttle 11 prev_pump_speed[0, 1] Previous pump speed 12 pue_norm[0, 2] Current Power Usage Effectiveness 13 T_amb_norm[0, 1] Ambient temperature 14 freq_dev_norm[-1, 1] Normalised grid frequency deviation (swing equation) 15 v_pcc_pu[0, 1.1] PCC voltage in per-unit (Thévenin model) 16 backlog_norm[0, 2] Deferred batch queue depth / p_flex_max (Little's Law queue) 17 committed_mw_norm[0, 1] Current DR commitment / committed_mw_max
The scalar reward received at every 5-second tick has seven additive terms:
where:
-
$(x)^{+} = \max(0,, x)$ — ReLU / hinge: only positive exceedances are penalised -
$u_{\text{thr}} \in [0,1]$ — DVFS throttle fraction; fraction of flexible batch capacity currently committed -
$\Delta P_{\text{demand}} = C_{\text{MW}} \times \text{RegD}(t)$ — MW change requested by the grid operator this tick -
$\Delta P_{\text{actual}} = P_{\text{flex,served}} + P_{\text{BESS,actual}}$ — MW change the DC actually delivered -
$P_{\text{norm}} = C_{\text{MW}} \times 1000$ — normalisation constant (converts tracking error to a [0, ~2] range) -
$T$ — temperature of the hotter of the two cooling zones (°C) -
$T_{\text{warn}} = 33,°\text{C}$ — soft warning threshold; thermal penalty begins here, 2 °C before the hard trip - $\mathbf{1}{\text{soc}}$ — binary flag: 1 if BESS state-of-charge is below $\text{SOC}{\min} + 2%$ (i.e. below 12%), else 0
-
$|\Delta f|$ — absolute grid frequency deviation (Hz) from the 60 Hz nominal -
$\varepsilon_v = (0.95 - v_{\text{pcc}})^{+} + (v_{\text{pcc}} - 1.05)^{+}$ — PCC voltage exceedance (pu) outside the ANSI C84.1 Range A band$[0.95, 1.05]$ -
$Q_{\text{backlog}}$ — deferred batch work currently sitting in the FIFO queue (kW-equivalent) -
$P_{\text{flex,max}} \approx 90{,}000,\text{kW}$ — peak flexible IT capacity at full throttle (1,200 racks × 75 kW) - Coefficients (all in
config.yaml):$\alpha{=}1.0$ ,$\beta{=}2.0$ ,$\gamma{=}5.0$ ,$\delta_{\text{soc}}{=}0.5$ ,$\delta_f{=}2.0$ ,$\delta_v{=}5.0$ ,$\delta_q{=}2.0$
| # | Term | Coefficient | What it measures | Why it matters |
|---|---|---|---|---|
| 1 | Throughput | Fraction of max IT capacity actually committed ( |
Maximising revenue — the agent earns more for accepting more DFS workload | |
| 2 | RegD tracking | Normalised absolute error between the FERC-requested power change and what the DC actually delivered | The primary ancillary-service obligation — missing this is penalised twice as hard as raw throughput gains | |
| 3 | Thermal overrun | Degrees above the warning threshold $T_{\text{warn}} = 33°$C for the hotter of the two cooling zones | Linear ramp long before the hard 35 °C trip; |
|
| 4 | BESS SoC | Binary flag: 1 if the battery state-of-charge falls below |
Flat per-tick penalty prevents the BESS from being stranded near empty when a RegD ramp arrives | |
| 5 | Frequency deviation | Frequency excursion beyond the ±0.2 Hz NERC dead-band | Proportional penalty that steepens as the grid approaches the ±0.5 Hz trip threshold | |
| 6 | Voltage deviation | One-sided penalty for PCC voltage outside [0.95, 1.05] pu | Voltage violations are fast and dangerous; the large coefficient forces early corrective action | |
| 7 | SLA backlog | FIFO queue depth normalised by peak flexible capacity |
Deferred batch jobs accumulate in queue; this term penalises latency and incentivises draining the queue |
Terms 1 and 2 are structurally opposed:
-
Higher throttle (
$u_{\text{thr}} \uparrow$ ) → more revenue from IT (term 1 ↑) but increases the power baseline, making it harder to deliver a downward RegD ramp accurately (term 2 ↓). -
Lower throttle (
$u_{\text{thr}} \downarrow$ ) → improves tracking flexibility but sacrifices revenue and grows the backlog (term 7 ↓).
The optimal agent learns a lever hierarchy: use BESS charge/discharge first (zero-penalty, fast), then exploit thermal inertia of the cooling system (slow, cheap), and only fall back to DVFS throttling as a last resort. This mirrors real-world FERC-paid frequency regulation.
All coefficients are chosen so that terms land in the same numerical range under typical operation:
-
$\alpha = 1$ → throughput at$u_{\text{thr}} = 0.8$ contributes$+0.8$ per tick -
$\beta = 2$ → a 40% normalised tracking error contributes$-0.8$ per tick -
$\gamma = 5$ → 1 °C overshoot contributes$-5$ per tick, dominating immediately -
$\delta_v = 5$ → 5% voltage sag contributes$-0.25$ per tick, matching the thermal scale
The RegD tracking error is computed as:
where
| Agent | Typical range | Notes |
|---|---|---|
| Random policy | −15,000 to −5,000 | Frequent thermal & voltage trips |
| Rule-based (threshold control) | −2,000 to +500 | No backlog awareness |
| PPO (trained, 5 M steps) | +2,000 to +5,000 | Learns lever hierarchy |
| Adversarial scenario C | −5,000 to −1,000 | High ambient temp + price spike |
Termination (episode ends immediately):
- Thermal fault: $T_A > 35°$C or $T_B > 35°$C
- Frequency fault:
$|f - f_{\text{nom}}| > 0.5$ Hz (UFLS / over-frequency trip) - Voltage fault:
$v_{\text{pcc}} < 0.90$ pu (under-voltage relay)
Episode truncates at 17,280 ticks (24 hours at 5 s).
C2G-Bench exposes exactly two Gymnasium environments —
C2GFastEnvandC2GMacroEnv— both registered undergym.make(). Everything below is not an environment: the six physics engines are internal simulation components with noreset()/step()orobservation_space/action_spaceAPI. They are called exclusively by the two environments and are never exposed to an RL agent directly. If you want to interact with a physics engine in isolation (e.g. for unit testing or analysis), instantiate it directly fromc2g_env.physics.*.
Six independent physics/data modules, all with exact-exponential or analytical solutions (unconditionally stable):
| Simulator | File | Description |
|---|---|---|
| Workload Orchestrator | workload.py |
Fuses Alibaba batch (2023), DLRM (2025), and GenAI (2026) traces into P_base + P_flex at 5-min resolution. FIFO queue model: unserved batch work defers rather than drops; exposes backlog_kw and avg_delay_steps (Little's Law) per step |
| Thermal Twin | thermal.py |
Exact exponential ODE integration for dual-zone cooling (Zone A: HPE Cray EX liquid, Zone B: HPE ProLiant air) |
| Electrical Chain | electrical.py |
Non-linear UPS/PDU/XFMR loss curves + PUE calculation |
| BESS | bess.py |
15 MWh / 5 MW Li-ion NMC (pure-Python backend + optional PySAM) with C-rate η, SOC derating, capacity fade |
| Macro-Grid | macro_grid.py |
AR(1) RegD signal + LMP proxy; calibrated for 6 global markets |
| Weather | weather.py |
NOAA ISD-Lite real data or calibrated synthetic (6 climate profiles) |
| Dataset | Source | Markets/Zones | Resolution | Files |
|---|---|---|---|---|
| Workload traces | Alibaba cluster traces | batch, DLRM, GenAI, spot | 5-min | 4 CSVs |
| Energy load | EIA, SMARD.de, AEMO | NYISO (11 zones), PJM, CAISO, ERCOT, ENTSO-E DE, AEMO NSW | 5-min (resampled) | 16 CSVs |
| Weather | NOAA ISD-Lite | NYC, DCA, SJC, DFW, FRA, BKT | Hourly | 7 CSVs |
The benchmark fuses three Alibaba production traces to model the IT load of the 250 MW facility. Each trace has a distinct statistical character, hardware zone assignment, and role in the control problem.
| File | Source | Duration | Zone | Role | Controllable? |
|---|---|---|---|---|---|
batch_v2023.csv |
Alibaba GPU v2023 (openb_pod_list_default.csv) |
33 days | A (GPU liquid-cooled) | P_flex — schedulable batch jobs |
✅ DVFS throttle action[0] defers work into FIFO queue |
dlrm_v2025.csv |
Alibaba GPU v2025 (disaggregated_DLRM_trace.csv) |
30 days | B (CPU air-cooled) | P_base — rigid DLRM inference serving |
❌ Must be served regardless of grid state |
genai_v2026.csv |
Alibaba v2026 GenAI (qps.csv and pod_gpu_duty_cycle_anon.csv) |
1 day (tiled) | A (GPU liquid-cooled) | P_base — rigid GenAI inference, spike-prone |
❌ Must be served; spikes set obs[9]=1 |
spot_v2026.csvis bundled but excluded from the current release — it requires an arrival-based preemptible scheduler not yet implemented.To reproduce these processed CSVs from the raw Alibaba data, see
preprocessing/workload_traces/. All three files are loaded at startup byc2g_env.physics.workload.WorkloadSimulator.
All three utilisation signals are translated to rack-level electrical power via the non-linear server power model (Fan et al., ISCA 2007):
| Stream | Racks | Utilisation normaliser | |||
|---|---|---|---|---|---|
| Batch (Zone A flex) | 1 200 | 8 kW/rack | 25 kW/rack | 1.4 (GPU superlinear) | gpu_milli_request / 12 620 |
| GenAI (Zone A base) | 800 | 8 kW/rack | 25 kW/rack | 1.4 | avg_gpu_duty_cycle / 100 |
| DLRM (Zone B base) | 2 500 | 4 kW/rack | 16 kW/rack | 1.2 (CPU inference) | active_gpu_count / 227 |
Resulting power envelope (30-day mean at default scenario):
| Stream | Mean power | Max power | Share of total IT |
|---|---|---|---|
| DLRM P_base (Zone B) | ~21.7 MW | 40.0 MW | 56% |
| Batch P_flex (Zone A) | ~10.1 MW | 30.0 MW | 26% |
| GenAI P_base (Zone A) | ~6.8 MW | 8.3 MW | 18% |
| Total IT | ~38.5 MW | — | 100% |
74% of IT power is rigid (P_base) — the agent's primary controllable lever is
batch throttling which covers only the remaining 26%.
batch_v2023.csv — Schedulable Batch (P_flex)
- Column:
gpu_milli_request(sum of GPU milli-cores requested per 5-min tick) - Statistics: 78% of ticks have zero arrivals; mean utilisation ≈ 0.043; max = 12 620 gpu-milli
- Nature: Highly bursty. Jobs arrive sporadically with durations from 1–2 825 ticks (5 min to 9.8 days).
Unserved work accumulates in a FIFO queue (tracked as
backlog_kwandavg_delay_stepsvia Little's Law). - Agent implication: DVFS throttle (
action[0]) directly gates the batch service rate. Throttling below 1.0 reduces thermal load and peak grid draw at the cost of growing backlog. Reward term 1 (throughput) penalises lowaction[0].
Utilisation distributions: batch is 78% zero (bursty); DLRM is near-Gaussian (always-on); GenAI is multimodal low-duty.
dlrm_v2025.csv — DLRM Inference (P_base, Zone B)
- Columns:
active_gpu_count,active_cpu_cores,active_mem_gib - Statistics: Always non-zero (min=1 GPU); mean ≈ 101 GPUs; near-Gaussian distribution with a clear two-shift diurnal pattern.
- Nature: Continuous, predictable. DLRM (Deep Learning Recommendation Model) serving is the backbone of Zone B — it never drops below idle power. The 30-day trace captures weekday/weekend cycling clearly.
- Agent implication: Contributes the largest fixed baseload (~21.7 MW). The only thermal handle for
Zone B is HVAC effort (
action[2]); DLRM itself cannot be throttled.
genai_v2026.csv — GenAI Serving (P_base, Zone A)
- Columns:
total_qps,avg_gpu_duty_cycle,active_genai_pods - Statistics: 288 ticks (1 day) tiled cyclically; duty cycle mean ≈ 6.7%, max 24.4%; spike rate ≈ 25%
- Nature: Multimodal — most time near-idle, with sharp afternoon QPS bursts.
Ticks where
avg_gpu_duty_cycle > P75 = 12.19%are flagged as spikes (obs[9] = 1). GenAI runs on the same Zone A GPU racks as batch but with strict SLA priority. - Agent implication: Spikes increase Zone A temperature rapidly (liquid cooling response time τ ≈ 13 min).
The safety shield terminates episodes if
T_A > 35°C. During spikes the agent must reduce batch load (action[0]) and possibly increase pump speed (action[1]) to prevent thermal fault.
Left: GenAI duty cycle with spike threshold (red dashes) and spike ticks (red dots). Right: spike probability peaks in afternoon hours.
Stacked IT power (MW) over 30 days and 1-week zoom. The DLRM base (orange) dominates; batch flex (blue) provides the agent's only demand-side handle.
ACF up to 24 hours. DLRM is highly persistent (slow decay with 24-hour periodicity). Batch decorrelates fastest — it is the hardest to predict. GenAI reveals its 1-day tile boundary.
The DLRM trace has the highest autocorrelation (predictable → MPC/rule-based works well for Zone B). Batch is the most volatile (decorrelates within ~2 hours), making it the prime target for RL.
Simulated backlog over 7 days at three throttle levels. At 50% throttle the queue stabilises near zero — the mean arrival rate is well within half-capacity. A completely off agent (throttle=0.3) accumulates ~10e3 MW equivalent backlog in 7 days.
This reveals a key benchmark insight: the batch queue is stable under mild throttle (≥ 40%) because the mean arrival rate (10.1 MW) is only 34% of full capacity (30 MW). The agent does not need to fully commit compute to clear the queue; it has real headroom to throttle for grid regulation.
See notebooks/11_workload_deep_dive.ipynb for full interactive analysis.
| Market Key | Region | Grid Operator | Energy Source | Weather Station |
|---|---|---|---|---|
nyiso_nyc |
New York City | NYISO | NYISO OASIS | NYC (Central Park) |
pjm_dom |
Northern Virginia | PJM | EIA API | DCA (Reagan Natl) |
caiso_pgae |
Bay Area / San Jose | CAISO | EIA API | SJC (Mineta Intl) |
ercot_north |
Dallas–Fort Worth | ERCOT | EIA API | DFW (DFW Intl) |
entso_de |
Frankfurt, Germany | ENTSO-E / EPEX | SMARD.de | FRA (Frankfurt) |
aemo_nsw |
Sydney, Australia | AEMO / NEM | AEMO CSVs | BKT (Bankstown) |
C2G-Bench ships four progressively harder 24-hour scenarios (17,280 ticks at 5 s each). Every scenario is fully deterministic when a fixed seed is set and can be combined with any of the six energy markets via a single Hydra override.
# Run any scenario × any market
uv run python baselines/train_ppo.py scenario=scenario_b market=ercot_northAll scenarios share the same underlying simulator stack and reward weights:
| Parameter | Value | Meaning |
|---|---|---|
| Episode length | 17,280 ticks | 24 h × 3,600 s h⁻¹ ÷ 5 s tick⁻¹ |
| IT capacity | 250 MW | Rigid (GenAI/DLRM) + flexible (Alibaba batch) |
| BESS | 15 MWh / 5 MW | NMC Li-ion, C-rate derating + capacity fade |
| Cooling zones | Zone A (liquid, HPE Cray EX) · Zone B (air, HPE ProLiant) | |
| 35 °C | Silicon hard limit → immediate termination | |
| 33 °C | Soft threshold → thermal penalty begins | |
| Frequency UFLS | ±0.5 Hz | Under/over-frequency relay → termination |
| Voltage UV relay | 0.90 pu | Under-voltage → termination |
"Can the agent learn to coordinate four physical levers under normal grid conditions?"
The entry-level scenario. Ambient temperature is comfortable (25 °C, NYISO NYC summer), BESS starts at 50 % SOC, and the regulation signal has standard amplitude. No faults are injected. This is the recommended starting point for algorithm development and ablation studies.
| Parameter | Value |
|---|---|
| Market | NYISO NYC |
| Ambient |
25 °C (weather-driven) |
| Committed MW (max) | 30 MW |
| BESS SOC₀ | 50 % |
| GenAI spike scale | 1.0× (nominal) |
| Grid stress scale | 1.0× (nominal) |
| Cooling fault | None |
Primary challenge: Learning the basic DVFS ↔ cooling ↔ BESS synergy to track the regulation signal while keeping temperatures below
Termination risk: Low. An untrained random agent survives ≈ 40 % of the episode on average.
"A viral model launch + a grid under-frequency event hit simultaneously. The agent must shed flexible load without starving the BESS."
This scenario models a Northern Virginia (PJM DOM) summer day when a new GPT-class model goes viral. GenAI serving load spikes to 1.8× nominal, consuming headroom that the agent would otherwise use for regulation. At the same time, the grid issues a sustained under-frequency signal, demanding active discharge. The agent must resolve the conflict between IT throughput and grid support.
| Parameter | Value |
|---|---|
| Market | PJM DOM |
| Ambient |
30 °C (static) |
| Committed MW (max) | 40 MW |
| BESS SOC₀ | 55 % |
| GenAI spike scale | 1.8× |
| Grid stress scale | 1.5× |
| Cooling fault | None |
Primary challenge: IT vs. grid conflict. The GenAI rigid load is non-throttleable, so the agent must use BESS discharge and batch-job throttling simultaneously — but throttling reduces throughput reward
Termination risk: Medium–High. Frequency faults are likely if the agent ignores the regulation signal. Thermal faults are possible if cooling is under-prioritised during spikes.
"Dallas in August: 40 °C ambient, a 30 MW commitment, and a cooling system pushed to its physical limits."
This scenario targets ERCOT North (DFW) during a peak-summer heat wave. The 40 °C ambient temperature drives the cooling COP down by ≈ 30 %, meaning the pump must work harder to achieve the same heat rejection. The committed MW is raised to 30 MW, increasing the power swings the agent must track. GenAI load is nominal, but the thermal margin to
| Parameter | Value |
|---|---|
| Market | ERCOT North |
| Ambient |
40 °C (static) |
| Committed MW (max) | 60 MW |
| BESS SOC₀ | 60 % |
| GenAI spike scale | 1.0× (nominal) |
| Grid stress scale | 1.3× |
| Cooling fault | None |
Primary challenge: Thermal constraint binding. The thermal penalty
Termination risk: Very High. A naive agent that ignores the pump lever will hit
"Western Sydney summer: the BESS starts nearly empty, the pump is failing, and the grid is stressed."
This scenario represents a compounding failure in AEMO NSW. The BESS begins at only 15 % SOC (near the 10 % hard floor), leaving almost no discharge capacity for regulation. A simulated CDU pump degradation reduces cooling efficiency to 60 % of nominal, tightening the thermal margin. GenAI and grid stress are both elevated. The agent must simultaneously ration the BESS, compensate for degraded cooling, and track the regulation signal — with essentially no buffer.
| Parameter | Value |
|---|---|
| Market | AEMO NSW |
| Ambient |
32 °C (static) |
| Committed MW (max) | 40 MW |
| BESS SOC₀ | 15 % |
| GenAI spike scale | 1.2× |
| Grid stress scale | 1.2× |
| Cooling fault | Pump degradation (60 % efficiency) |
Primary challenge: Resource scarcity under compound failure. The BESS SOC penalty
Termination risk: Extreme. This is the hardest scenario in the benchmark. A random agent terminates within ≈ 5 % of the episode on average.
All four scenarios can be combined with all six markets, yielding 24 distinct evaluation configurations. Market selection changes the LMP profile, weather driver, and grid-stress statistics, while scenario selection changes the hardware stress and initial conditions:
nyiso_nyc |
pjm_dom |
caiso_pgae |
ercot_north |
entso_de |
aemo_nsw |
|
|---|---|---|---|---|---|---|
default |
★ default | |||||
scenario_a |
★ default | |||||
scenario_b |
★ default | |||||
scenario_c |
★ default |
★ = default market for that scenario. Any other cell is a valid cross-market stress test.
# Example: Thermal Squeeze under European low-carbon prices
uv run python baselines/train_ppo.py scenario=scenario_b market=entso_de experiment.seed=1C2G-Macro/
├── pyproject.toml # uv/hatchling build + all dependencies
├── uv.lock # Reproducible dependency lock
├── README.md
│
├── c2g_env/ # The Core RL Environment
│ ├── __init__.py # Exports C2GFastEnv, C2GMacroEnv
│ ├── env_low_level.py # 5 s physics step — C2GFastEnv (18-D obs, 4-D act)
│ ├── env_high_level.py # 15-min market step — C2GMacroEnv (19-D obs, 2-D act)
│ ├── ENVIRONMENTS.md # 📖 Full environment & simulator reference (equations, params)
│ ├── config.yaml # Centralised env configuration
│ ├── experiments/
│ │ ├── __init__.py # Exports ActionAblationFastEnv
│ │ └── action_ablation_env.py # C2GFastEnv subclass for action-level ablation studies
│ └── physics/
│ ├── workload.py # Alibaba trace fusion (batch/DLRM/GenAI)
│ ├── thermal.py # Exact-exponential ODEs, dual-zone cooling
│ ├── electrical.py # Non-linear UPS/PDU/XFMR loss + PUE
│ ├── bess.py # 15 MWh NMC BESS (pure-Python + PySAM)
│ ├── macro_grid.py # AR(1) RegD + LMP proxy, 6 market presets
│ └── weather.py # NOAA ISD real data + synthetic climate, 6 presets
│
├── data/
│ └── processed/
│ ├── workload_traces/ # batch_v2023, dlrm_v2025, genai_v2026, spot_v2026
│ ├── energy/ # 16 CSVs: 11 NYISO zones + PJM/CAISO/ERCOT/ENTSO-E/AEMO
│ └── weather/ # 7 station CSVs: NYC, DCA, SJC, DFW, FRA, BKT, LONGIL + merged
│
├── conf/ # Hydra configuration tree
│ ├── config.yaml # Top-level defaults (scenario, algo, market, logging)
│ ├── algo/ # 19 algo configs: ppo, sac, ppo_macro, sac_macro, cpo,
│ │ # ppo_lagrangian, cbf_ppo, hj_ppo, mpcsf_ppo, ha_c2g,
│ │ # cbm_only, cbm_gate, cbm_shield, rule_macro_ppo, pid,
│ │ # mpc_fast, mpc_macro, milp, shield_reward_shaping
│ ├── scenario/ # default, scenario_a, scenario_b, scenario_c
│ ├── market/ # nyiso_nyc, pjm_dom, caiso_pgae, ercot_north, entso_de, aemo_nsw
│ └── logging/ # tensorboard.yaml
│
├── baselines/ # NeurIPS Evaluation Agents
│ ├── _hydra_compat.py # Hydra 1.3.x compatibility patch for Python ≥ 3.14
│ ├── metrics_callback.py # C2GMetricsCallback — per-episode CSV + TensorBoard
│ │
│ │ # ── Classical Controllers ───────────────────────────────────────────
│ ├── rule_based_mpc.py # Threshold controller for C2GFastEnv (SB3-compatible)
│ ├── rule_based_macro.py # Macro-level rule-based controller for C2GMacroEnv
│ ├── bang_bang.py # Bang-bang / hysteresis controller (floor baseline)
│ ├── pid_controller.py # Multi-loop PID controller with anti-windup
│ │
│ │ # ── RL Training Scripts ─────────────────────────────────────────────
│ ├── train_sac.py # SB3 SAC (off-policy, auto entropy)
│ ├── train_hierarchical.py # Two-phase sequential HRL pipeline (PPO inner)
│ ├── train_hierarchical_sac.py # Two-phase HRL with SAC inner policy
│ ├── train_rule_macro_sac.py # Rule-based macro + SAC inner policy
│ ├── train_lowsac_highrandom.py # SAC lower + random macro (ablation)
│ ├── train_llm_agents.py # LLM-guided agent training
│ │
│ └── safety/ # HA safety methods + shielded training scripts (see §11)
│
├── evaluation/ # Benchmark auditing & analysis
│ ├── run_benchmark.py # Standard benchmark: runs agents on all 4 scenarios
│ │ # Outputs: CSV with cumulative power metrics at
│ │ # evaluation/results/{algo}_{scenario}_{agent_type}_{ablation}.csv
│ ├── run_ha_benchmark.py # HA safety benchmark: 11-metric evaluation set
│ │ # Same cumulative power metrics as run_benchmark.py
│ ├── generate_plots.py # Publication-ready PDF/PNG figures
│ ├── generate_ha_plots.py # HA-specific: Pareto frontier, radar, violin plots
│ ├── plot_episode_traces.py # Per-episode trace analysis with ablation filtering
│ ├── failure_analysis.py # Failure-case categorisation for HA benchmark
│ └── statistical_analysis.py # Bootstrap CIs, Welch's t-test, Cohen's d, LaTeX tables
│
├── scripts/ # Data download & training utilities
│ ├── download_weather.py # Open-Meteo ERA5 → 6 weather CSVs
│ ├── download_energy.py # EIA + SMARD + AEMO → 5 energy CSVs
│ └── run_sweep.sh # Full training sweep (25 phases, ~270 jobs)
│
├── preprocessing/ # Raw → processed data pipelines
│ ├── workload_traces/ # process_v2023.py, process_v2025.py, process_v2026_genai.py
│ ├── energy/ # process_energy.py (NYISO zone load)
│ └── weather/ # download_noaa_isd.py
│
├── notebooks/ # 11 Jupyter notebooks for exploration & visualisation
│ ├── 01_workload.ipynb # Alibaba trace analysis
│ ├── 02_thermal.ipynb # Thermal model step response & steady-state
│ ├── 03_electrical_bess.ipynb # Electrical chain + BESS cycling
│ ├── 04_macro_grid.ipynb # RegD signal + LMP proxy
│ ├── 05_environments.ipynb # Gym API demo, scenario comparison
│ ├── 06_weather.ipynb # Weather data: 6 markets, real vs. synthetic
│ ├── 07_energy_markets.ipynb # Energy load: 6 markets, LDC, diurnal patterns
│ ├── 08_frequency_voltage.ipynb # Grid frequency & PCC voltage safety signals
│ ├── 09_evaluation_scenarios.ipynb # Scenario deep dive: params, rollouts, risk, reward
│ ├── 10_baselines_visualization.ipynb # Baseline agent comparison & visualisation
│ └── 11_workload_deep_dive.ipynb # Workload queue dynamics & trace statistics
│
├── tests/ # 531 tests (pytest)
│ ├── test_workload.py # 24 tests
│ ├── test_thermal.py # 32 tests
│ ├── test_electrical.py # 27 tests
│ ├── test_macro_grid.py # 30 tests
│ ├── test_weather.py # 23 tests
│ ├── test_gym_api.py # 72 tests (API compliance both envs)
│ ├── test_baselines.py # 18 tests
│ ├── test_new_baselines.py # 50 tests (classical + gradient-free baselines)
│ ├── test_frequency_voltage.py # 31 tests (freq/voltage safety signals)
│ ├── test_hierarchical.py # 22 tests (HRL, macro agents)
│ ├── test_safety_shield.py # 24 tests (Simplex shield, wrappers)
│ ├── test_ha_safety.py # 70 tests (3-tier HA safety methods)
│ ├── test_critical_bug_fixes.py # 50 tests (regression tests)
│ ├── test_ablation.py # 18 tests (action ablation env)
│ ├── test_readme_smoke.py # 13 tests (README code snippet validation)
│ ├── test_datalogging.py # 7 tests (transition logging schema + 5 smoke tests)
│ # Hardware vs macro column validation
│ # 5 CLI smoke tests: rule_macro, rule_based,
│ # rule_based+BESS_ablation, ha_rule_based (variants)
│
└── figures/ # Root-level figures (TensorBoard screenshot, etc.)
- Python 3.11 (exact;
==3.11.*inpyproject.toml) - uv — fast Python package manager
curl -LsSf https://astral.sh/uv/install.sh | shgit clone <repo-url>
cd C2G-Macro
uv sync
uv sync --extra dev # pytest, ruff, mypyuv run pytest tests/ -q
# 531 passed# PPO — default scenario, 300k steps
uv run python baselines/train_ppo.py
# PPO — GenAI Crisis + PJM market
uv run python baselines/train_ppo.py scenario=scenario_a market=pjm_dom
# SAC — Thermal Squeeze
uv run python baselines/train_sac.py algo=sac scenario=scenario_b
# Hydra multirun — all scenarios × 3 seeds
uv run python baselines/train_ppo.py --multirun \
scenario=default,scenario_a,scenario_b,scenario_c \
experiment.seed=1,2,3
# Hierarchical RL — sequential two-phase pipeline
uv run python baselines/train_hierarchical.py
# Safety-shielded PPO (provable constraint satisfaction)
uv run python baselines/safety/train_shielded_ppo.py scenario=default
# Constrained RL — PPO-Lagrangian
uv run python baselines/safety/train_ppo_lagrangian.py scenario=default
# CPO — Constrained Policy Optimization
uv run python baselines/safety/train_cpo.py scenario=default
# CBF-shielded PPO (QP-based action projection)
uv run python baselines/safety/train_cbf_ppo.py scenario=default
# Full HA-C2G neuro-symbolic 3-layer architecture
uv run python baselines/safety/train_ha_c2g.py scenario=default# Dry-run first — prints all 48 jobs without executing anything:
bash scripts/run_sweep.sh --dry-run
# Full sweep (default: 4 parallel jobs):
bash scripts/run_sweep.sh
# Use more parallelism (208 cores available — 16 is safe):
MAX_PARALLEL=16 bash scripts/run_sweep.shThe sweep runs in 25 phases:
| Phase | Jobs | What runs |
|---|---|---|
| 1 | 24 | Rule-Based + Random evaluation only (no training, ~5 min) |
| 2 | 12 | PPO training (300k steps) + evaluation |
| 3 | 12 | SAC training (200k steps) + evaluation |
| 4 | 12 | Macro Rule-Based evaluation |
| 5 | 12 | PPO-Macro training (100k steps) + evaluation |
| 6 | 12 | HRL sequential training (300k + 100k) + evaluation |
| 7 | 36 | Bang-Bang, PID, MPC evaluation (no training) |
| 8 | 24 | MPC-Macro & MILP evaluation (no training) |
| 9 | 12 | PPO-Lagrangian training (300k) + evaluation |
| 10 | 12 | CBF-PPO training (300k) + evaluation |
| 11 | 12 | HJ-PPO training (300k) + evaluation |
| 12 | 12 | MPC-SF-PPO training (300k) + evaluation |
| 13 | 12 | CPO training (300k) + evaluation |
| 14 | 12 | Shield-Reward-Shaping training (300k) + evaluation |
| 15 | 12 | HA-C2G neuro-symbolic training (300k) + evaluation |
| 16 | 12 | CBM-Only ablation training (300k) |
| 19 | 12 | CBM+Gate ablation training (300k) |
| 20 | 12 | CBM+Shield ablation training (300k) |
| 21 | 1 | HA Benchmark evaluation (11 metrics, 5 episodes) |
| 22 | 1 | Summary table + LaTeX rows |
| 23 | 1 | Multi-seed HA benchmark (10 seeds × 5 episodes) |
| 24 | 1 | Statistical analysis (CIs + significance tests) |
| 25 | 1 | Failure-case analysis |
Results are written to results/sweep_results.csv (one row per run, upserted on re-runs) and results/sweep_summary.csv (mean ± std across seeds).
Use the evaluation runners when you want targeted experiments instead of the full sweep.
The --fixed-action setting allows granular control experiments by pinning selected actuators to analyst-chosen setpoints.
Unless --output path provided, results saved by default at:
evaluation/results/{algo}_{scenario}_{agent_type}_{ablation}.csv
e.g. ppo_scenario_b_hardware_BESS_0.5.csv stores evals for hardware PPO agent with fixed BESS ablation. Here agent_type denotes the transition-logging and output suffix for the evaluated controller, and can be hardware, macro, or hardware_ha.
Runs any combination of agents across all four evaluation scenarios and writes per-episode metrics to CSV.
# Classical hardware controllers (no trained models needed)
uv run evaluation/run_benchmark.py --agents rule_based bang_bang pid random
# SAC low-level agent (requires a trained model)
uv run evaluation/run_benchmark.py --agents sac --scenarios default scenario_b \
--hw-model-dir trained_models/sac_default_s100
# Hierarchical combos: rule-based macro + hardware controller
uv run evaluation/run_benchmark.py --agents rule_macro+sac rule_macro+rule_based \
rule_macro+pid rule_macro+bang_bang rule_macro+random \
--hw-model-dir trained_models/sac_default_s100
# RL macro (Phase 2) + frozen SAC low-level
uv run evaluation/run_benchmark.py --agents sac_macro+sac \
--macro-model-dir trained_models/sac_macro_default_s100 \
--hw-model-dir trained_models/sac_default_s100
# LLM macro + hardware controller (requires a running vLLM server)
uv run evaluation/run_benchmark.py --agents llm_policy_macro+sac \
--hw-model-dir trained_models/sac_default_s100 \
--llm-api-base http://localhost:8000/v1
# With transition logging (per-step CSV traces)
uv run evaluation/run_benchmark.py --agents rule_macro+sac --record_transitions \
--hw-model-dir trained_models/sac_default_s100SAC agents automatically load the model from trained_models/<algo>_<scenario>_s<seed>/final_model.zip. Use --hw-model-dir or --macro-model-dir to override.
Agents used in the paper:
| Agent | Type | Description |
|---|---|---|
random |
hardware | Uniform random baseline (lower bound) |
bang_bang |
hardware | Hysteresis on/off controller |
pid |
hardware | Multi-loop PID with anti-windup |
rule_based |
hardware | Threshold heuristic controller (baselines/rule_based_mpc.py) |
sac |
hardware | Trained SAC low-level controller (Phase 1) |
rule_macro |
macro | Rule-based macro bidding controller |
sac_macro |
macro | Trained SAC macro controller (Phase 2) |
llm_policy_macro |
macro | LLM macro controller (Qwen3-32B, ICRL) |
<macro>+<hardware> |
combo | Macro agent paired with hardware agent, e.g. rule_macro+sac, llm_policy_macro+pid |
Fixed-action ablations (Appendix M):
Pin actuators to fixed setpoints to isolate each lever's contribution:
# Disable BESS (throttle + cooling only)
uv run evaluation/run_benchmark.py --agents rule_macro+sac \
--fixed-action bess_dispatch=0.0 \
--hw-model-dir trained_models/sac_default_s100
# Disable BESS and fix cooling (throttle only)
uv run evaluation/run_benchmark.py --agents rule_macro+sac \
--fixed-action bess_dispatch=0.0 \
--fixed-action pump_speed_A=0.7 \
--fixed-action hvac_effort=0.7 \
--hw-model-dir trained_models/sac_default_s100Action bounds: throttle_batch ∈ [0, 1], pump_speed_A ∈ [0, 1], hvac_effort ∈ [0, 1], bess_dispatch ∈ [-1, 1]. Values are validated and clipped to these ranges.
Key CLI arguments:
| Argument | Default | Description |
|---|---|---|
--agents |
rule_based bang_bang pid random |
One or more agent names (see table above) |
--scenarios |
all 4 | Subset of default scenario_a scenario_b scenario_c |
--n_episodes |
5 |
Episodes per agent × scenario combination |
--seed |
100 |
Starting RNG seed; episode i uses seed + i |
--hw-model-dir |
None |
Model directory for the hardware/inner SAC agent |
--macro-model-dir |
None |
Model directory for the macro-level SAC agent |
--output |
auto-generated | Output CSV path; defaults to evaluation/results/<algo>_<scenario>_<agent_type>_<ablation>.csv |
--record_transitions / --no-record_transitions |
disabled | Write per-step state/action/reward traces to runs/<agent>_<scenario>_<type>/episode*.csv |
--append |
False |
Append rows to an existing CSV instead of overwriting |
--fixed-action <name>=<value> |
none | Pin an actuator to a fixed setpoint (repeatable) |
--llm-api-base |
http://localhost:8000/v1 |
vLLM / OpenAI-compatible server URL |
--llm-template-path |
conf/chat_templates/run_benchmark_rbc+ICRL.yaml |
YAML prompt templates for LLM agents |
--llm-max-new-tokens |
8192 |
Maximum tokens per LLM generation step |
--llm-temperature |
0.0 |
Sampling temperature (0 = greedy) |
--llm-no-thinking |
off | Disable <think> reasoning blocks |
--llm-context-num-steps |
10 |
ICRL rolling buffer size in past steps (paper uses 5; 0 = disabled) |
--llm-context-stride |
1 |
Store every K-th step in the ICRL buffer |
--llm-icrl-mode |
autonomous |
ICRL instruction mode: autonomous (paper default), preset, or exploit |
Output metrics:
For hardware agents, each row in the output CSV contains:
| Column | Description |
|---|---|
mean_reward |
Mean step reward over the episode |
total_reward |
Sum of step rewards |
tracking_rmse |
RMSE of ΔP_demanded − ΔP_actual (kW) |
thermal_viol_rate |
Fraction of ticks with temperature > T_warn (33 °C) |
throughput_ratio |
Mean p_flex_served / p_flex_nom |
bess_degradation |
Cumulative capacity fade × 10⁴ |
episode_length |
Ticks completed (< 17 280 indicates early termination) |
survival_rate |
Fraction of episodes surviving to 24 h |
For macro agents, additional columns include:
| Column | Description |
|---|---|
bid_acceptance_rate |
Fraction of 15-min bids accepted by the grid |
total_reg_revenue |
Cumulative regulation revenue (USD) |
mean_perf_score |
Mean FERC performance score |
mean_committed_mw |
Mean accepted MW commitment per interval |
For hierarchical combo agents (macro+hardware), results are split into *_macro.csv and *_hardware.csv automatically, with a separate hardware-schema row for the inner controller enabling direct comparison with standalone hardware results.
When --record_transitions is enabled, per-step logs are written under runs/<agent>_<scenario>_<agent_type>/episode*.csv.
uv run evaluation/run_ha_benchmark.py --agents simplex_ppo cbf_ppo hj_ppo
uv run evaluation/run_ha_benchmark.py --agents ha_c2g --scenarios default scenario_c --n_episodes 5
uv run evaluation/run_ha_benchmark.py --agents cbf_ppo --record_transitions
uv run evaluation/run_ha_benchmark.py --agents cbf_ppo --no-record_transitions
uv run evaluation/run_ha_benchmark.py --fixed-action bess_dispatch=0.0
uv run evaluation/run_ha_benchmark.py \
--fixed-action hvac_effort=0.9 \
--fixed-action bess_dispatch=0.0Key options:
--agents: HA agents to evaluate--scenarios: scenarios to run--n_episodes: number of episodes per agent/scenario--seed: starting seed--model_dir: optional override for trained model directory--output: output CSV path--record_transitions/--no-record_transitions: enable or disable per-step transition logging--fixed-action action=value: assign a fixed value to an action
Notes:
- These settings allow granular experimentation and control for high-assurance studies as well: you can evaluate whether a safety method still works when specific actuators are pinned to fixed operating points.
- The same continuous low-level action ranges apply here:
throttle_batch ∈ [0, 1],pump_speed_A ∈ [0, 1],hvac_effort ∈ [0, 1], andbess_dispatch ∈ [-1, 1]. - Fixed-action overrides are applied inside the low-level environment before dynamics are applied.
- When enabled, transition logs are written under
runs/<agent>_<scenario>_ha/episode*.csv.
After generating transition logs (via --record_transitions in benchmark runners), you can visualize per-step state, action, observation, and reward traces as aggregated statistics (mean ± 99% CI across episodes). Writes per-step episode CSV files under runs/<algo>_<scenario>_<agent_type>/ (e.g., episode0__HVAC_disabled_BESS_0.csv).
Plot episode statistics:
# Basic usage (no ablation)
uv run evaluation/plot_episode_traces.py --algoname bang_bang --scenario default --agent-type hardware
# With ablation filters (plots only episodes matching specific disabled/fixed actions)
uv run evaluation/plot_episode_traces.py \
--algoname bang_bang \
--fixed-action pump_speed_A=0.25 \
--scenario default \
--agent-type macroOutputs:
- JPEG:
figures/<algo>_<scenario>_<agent_type>[__ABLATION_SUFFIX].jpeg - PDF:
figures/<algo>_<scenario>_<agent_type>[__ABLATION_SUFFIX].pdf
Each figure contains one subplot per state/reward column, shows the mean line (solid) with a 99% confidence band (shaded area) computed across all matching episodes.
- State variables (blue), with 0–1 reference bounds shown as dashed lines
- Cumulative reward components (red)
uv run python scripts/download_weather.py --year 2024
uv run python scripts/download_energy.py --year 2024All training scripts log scalar metrics (episode reward, episode length, thermal/tracking/SOC penalties, shield interventions) to TensorBoard. Logs are written to the Hydra output directory under tensorboard/.
# Point TensorBoard at the outputs directory to compare all runs:
uv run tensorboard --logdir outputs/
# Or at a specific run:
uv run tensorboard --logdir outputs/ppo_default/seed_42/2026-04-08_21-00-00/tensorboard/Then open http://localhost:6006 in your browser.
uv run jupyter lab notebooks/Note: The optional
nrel-pysamBESS backend requiresuv pip install nrel-pysam. The environment automatically falls back to the pure-Python_SimpleBESSModelif absent.
C2G-Bench provides a comprehensive 3-tier high-assurance (HA) safety framework for grid-interactive data center control. All tiers enforce the same 5 hard constraints (C1–C5) and are evaluated with an 11-metric set (6 standard + 5 HA-specific).
| ID | Constraint | Threshold | Physical Meaning |
|---|---|---|---|
| C1 | 35 °C (margin 1 °C) | Server room A thermal limit | |
| C2 | 35 °C (margin 1 °C) | Server room B thermal limit | |
| C3 | SOC ∈ [SOC_min, SOC_max] | [0.10, 0.95] (guard 0.03) | BESS operational envelope |
| C4 | $ | \Delta f | < 0.5$ Hz |
| C5 |
|
0.92 pu trigger | Under-voltage relay threshold |
| Method | Shield | Permissiveness | Cost | File |
|---|---|---|---|---|
| Simplex [Sha 2001] | O(1) analytic worst-case bounds | Conservative | Negligible | baselines/safety/safety_shield.py |
| CBF [Ames 2019] | QP projection into barrier-safe set | Moderate | Low | baselines/safety/cbf_shield.py |
| HJ Reachability | Offline BRS + runtime override | Moderate | Offline high, runtime low | baselines/safety/hj_shield.py |
| MPC Safety Filter | Receding-horizon constrained NLP | Most permissive | Highest online | baselines/safety/mpc_safety_filter.py |
# 1. Standalone filter — works with ANY agent
from baselines.safety.safety_shield import SafetyShield
shield = SafetyShield()
safe_action, was_modified, info = shield.filter(raw_action, obs)
# 2. Gymnasium wrapper — agent trains inside safe manifold
from baselines.safety.safety_shield import ShieldedEnv
env = ShieldedEnv(C2GFastEnv(scenario="default"))
# 3. SB3-compatible agent wrapper — for evaluation
from baselines.safety.safety_shield import ShieldedAgent
safe_agent = ShieldedAgent(trained_agent, env)# Simplex-shielded PPO
uv run python baselines/safety/train_shielded_ppo.py scenario=default experiment.seed=42
# CBF-shielded PPO (QP-based, more permissive than Simplex)
uv run python baselines/safety/train_cbf_ppo.py scenario=default
# HJ reachability-shielded PPO (offline BRS computation)
uv run python baselines/safety/train_hj_ppo.py scenario=default
# MPC safety filter PPO (receding-horizon, most permissive)
uv run python baselines/safety/train_mpcsf_ppo.py scenario=default| Method | Mechanism | File |
|---|---|---|
| PPO-Lagrangian | Adaptive Lagrange multipliers for 3 cost types | baselines/safety/train_ppo_lagrangian.py |
| CPO [Achiam 2017] | Trust-region with conjugate gradient + line search | baselines/safety/train_cpo.py |
| Shield Reward Shaping | Fixed quadratic distance-to-boundary penalties | baselines/safety/train_shield_reward_shaping.py |
The full HA-C2G pipeline is a 3-layer neuro-symbolic architecture:
- Layer 1 — Concept Bottleneck Model (
baselines/safety/concept_bottleneck.py): Maps raw 17-D obs → ~10 interpretable safety concepts (thermal margins, SOC health, etc.) - Layer 2 — Safe Projection Gate (
baselines/safety/safe_projection.py): Concept-guided differentiable projection that blends policy actions toward safe priors based on learned pass-through gates; applied consistently during training and evaluation forha_c2gandcbm_gate - Layer 3 — Physics Rule Shield: In-the-loop Simplex shield with shield-penalty reward
Ablation studies isolate each layer's contribution:
| Variant | CBM | Gate | Shield | File |
|---|---|---|---|---|
| HA-C2G (full) | ✅ | ✅ | ✅ | baselines/safety/train_ha_c2g.py |
| CBM-Only | ✅ | ❌ | ❌ | baselines/safety/train_cbm_only.py |
| CBM+Gate | ✅ | ✅ | ❌ | baselines/safety/train_cbm_gate.py |
| CBM+Shield | ✅ | ❌ | ✅ | baselines/safety/train_cbm_shield.py |
Proof trees (baselines/safety/proof_tree.py) generate per-timestep hierarchical audit logs documenting which safety rules passed/failed and the sensor readings grounding each decision.
| Category | Metric | Description |
|---|---|---|
| Standard | mean_reward |
Mean episode reward |
| Standard | tracking_rmse |
RegD tracking RMSE |
| Standard | thermal_viol_rate |
Fraction of ticks with thermal violation |
| Standard | throughput_ratio |
Fraction of max IT capacity served |
| Standard | bess_degradation |
Battery capacity fade over episode |
| Standard | survival_rate |
Fraction of episodes surviving to 24 h |
| HA | hard_violation_rate |
Rate of C1–C5 constraint violations |
| HA | shield_intervention_rate |
How often the shield overrides the agent |
| HA | constraint_margin |
Mean distance from nearest constraint boundary |
| HA | worst_case_margin |
Minimum margin across all constraints |
| HA | computational_overhead_ms |
Per-step shield compute time |
# HA benchmark evaluation (11 metrics across all HA agents)
uv run evaluation/run_ha_benchmark.py --agents simplex_ppo cbf_ppo hj_ppo mpcsf_ppo ha_c2g
# HA-specific plots (Pareto frontier, radar, violin, LaTeX table)
uv run evaluation/generate_ha_plots.py
# Failure-case analysis (where/why/how often agents fail)
uv run evaluation/failure_analysis.py
# Statistical analysis (bootstrap CIs, Welch's t-test, Cohen's d)
uv run evaluation/statistical_analysis.py- Renewable Integration: Data centers absorb excess wind/solar, preventing curtailment.
- Grid Stability: The DC acts as a "shock absorber" for the transmission grid, reducing reliance on fossil-fuel peaker plants.
- Cyber-Physical Benchmark: The first high-fidelity, multi-market testbed for hierarchical RL on real infrastructure physics.
- Six Global Markets: NYISO, PJM, CAISO, ERCOT, ENTSO-E, AEMO — largest DC hubs on Earth.
- DOE Genesis Alignment: 250 MW–1 GW scale matches the US national AI infrastructure program.
@inproceedings{c2gbench2026,
title = {{C2G-Bench}: A Cyber-Physical Benchmark for Grid-Interactive
Hyperscale Data Centres},
author = {Anonymous},
booktitle = {NeurIPS 2026 Datasets and Benchmarks Track},
year = {2026},
}All figures are generated by the notebooks in notebooks/ and can be reproduced by running uv run jupyter lab notebooks/.
Left-to-right, top-to-bottom: Fused workload time-series (rigid GenAI/DLRM + flexible batch + spot); power histogram per trace type; GenAI spike characterisation showing burst magnitude and inter-arrival distribution; DVFS throttle curve mapping throttle level to actual batch power reduction.
Step response (cold-start to thermal equilibrium); steady-state map (temperature vs. pump speed); COP degradation with ambient temperature (ERCOT 40 °C peak visible); HVAC parameter sweep; fault injection showing temperature excursion under 60% pump efficiency (Scenario C).
Power breakdown across the facility electrical chain (IT → UPS → PDU → transformer); PUE surface showing how ambient temperature and load interact; UPS non-linear efficiency curve; BESS charge/discharge cycle with SOC tracking and capacity fade; round-trip efficiency vs. C-rate.
24-hour regulation signal (AR(1) calibrated per market); power spectral density and statistics; LMP proxy across 6 global markets showing diurnal and seasonal patterns; ACF plot confirming AR(1) calibration quality.
C2GFastEnv 24-hour rollout (temperature, SOC, power, reward traces); reward component breakdown (tracking error, thermal penalty, SOC penalty, freq/voltage penalties); step-reward distribution; observation space coverage showing all 16 dimensions are exercised; C2GMacroEnv rollout at 15-min resolution; cross-scenario reward comparison.
Annual temperature profiles for all 6 markets (NYC, DCA, SJC, DFW, FRA, BKT); annual distribution; diurnal patterns by season; synthetic vs. real NOAA ISD validation; implied COP showing how weather drives cooling cost; normalised histograms; southern hemisphere seasonal inversion (AEMO NSW summer = January).
Annual grid load (NYISO 11-zone, PJM DOM, CAISO PG&E, ERCOT North, ENTSO-E DE, AEMO NSW); diurnal patterns; load duration curves + LMP distribution; grid stress indicator calibration used by macro_grid.py; joint weather–energy distribution (ambient temperature vs. LMP — key for thermal-economic co-optimisation).
Parameter overview (6 bar charts: T_amb, committed MW, BESS SOC₀, GenAI scale, grid stress, cooling efficiency); radar chart showing the overall stress fingerprint of each scenario; 2-hour rollout traces across all 4 scenarios for 5 physical signals; termination risk under 30 random-policy episodes; cumulative reward gap between scenarios; 24-configuration grid of all Scenario × Market pairings.
Zone temperature comparison across all 4 scenarios showing the thermal headroom difference driven by T_amb and committed MW settings.

















































