Reward Balance Scores: How RewardGuard Quantifies Misalignment

When a training run finishes and RewardGuard tells you the reward balance score is 0.34, what does that number actually mean? How was it computed? What would make it go up or down? This post is an inside look at the methodology — the math, the heuristics, and the design choices behind how RewardGuard turns a stream of reward component values into a single interpretable alignment signal.

Starting Point: Reward Components

Every reward function is a sum of components. In a navigation task, you might have a distance-to-goal component, a step-penalty component, and a collision-avoidance component. In a game environment, you might have a score component, an alive-time component, and a safety component. In an RLHF process, you have a preference score, a KL penalty, and a set of auxiliary filters.

The total reward is what the optimizer sees. The components are what tells you why the total is high. A total reward that's climbing because of progress toward the goal is healthy. A total reward climbing because the agent found a way to exploit the alive-time component is a problem — even if the numbers look identical from the outside.

RewardGuard requires you to log components separately. This is the foundational step. Everything else — ratios, baselines, scores — is computed from those component time series.

The Component Ratio

The simplest signal RewardGuard computes is the ratio between components. For a two-component reward with a primary goal component and a secondary constraint component:

Component Ratio (two-component) ratio(t) = R_goal(t) / R_constraint(t)

A healthy agent accumulates both components in proportion. An agent that's exploiting the constraint component shows this ratio drifting — the constraint component grows faster than the goal component, or the goal component flatlines while the constraint keeps climbing.

For environments with more than two components, RewardGuard computes pairwise ratios for each combination the user marks as relevant, plus a single aggregate balance score. The aggregate is a weighted deviation from the expected ratio profile established during the first N training steps (the "baseline window").

The Baseline Window

The hardest part of ratio analysis is defining "normal." A ratio that's concerning in step 50,000 might be completely expected in step 500 — early training is chaotic, and component ratios fluctuate wildly before the policy stabilizes.

RewardGuard establishes a baseline from the first segment of training, configurable via the baseline_steps parameter (default: 10% of expected total steps, or the first 1,000 steps if total is unspecified). During this window, the monitor observes the component ratio distribution and computes:

The mean ratio across the window
The standard deviation of the ratio
The drift rate — the slope of ratio change over time within the window

After the baseline window closes, deviations from the baseline mean are measured in standard deviations — a z-score for the ratio. This normalization means the threshold for flagging is relative to the specific environment's natural variance, not an absolute number.

Normalized Ratio Deviation z(t) = (ratio(t) - μ_baseline) / σ_baseline

Design Choice

Using z-scores rather than absolute thresholds means the same flag threshold (e.g., z > 2.5) applies consistently across environments with very different natural reward scales — from Atari games with thousands of points to robotics environments where rewards are fractions of a meter.

From Z-Score to Alignment Score

The raw z-score tells you how far the current ratio has drifted from the baseline. The alignment score converts this into a 0–1 value where 1 is fully aligned and 0 is completely misaligned. The conversion uses a sigmoid-based mapping:

Alignment Score alignment(t) = 1 / (1 + exp(k \cdot (|z(t)| - threshold)))

Where threshold is the configured flag threshold (default 2.5 standard deviations) and k is a steepness parameter (default 1.2). This produces a score that decays smoothly as drift increases, rather than a hard binary 0/1 flag at the threshold.

A few example scores for context:

Score: 0.94

0.94

Ratio within 0.5σ of baseline — normal training variance, no concern.

Score: 0.71

0.71

Ratio at 1.8σ — worth monitoring, no alert yet.

Score: 0.50

0.50

Ratio exactly at the 2.5σ threshold — flag triggered.

Score: 0.21

0.21

Ratio at 4σ — significant exploitation in progress.

The Trend Detector

A single-point z-score has a significant weakness: it can be triggered by transient spikes that correct themselves. A good hacking detection system should distinguish between a momentary outlier and a sustained drift. RewardGuard addresses this with a trend layer on top of the ratio z-score.

The trend detector maintains a rolling window of recent z-scores (default: 50 steps) and fits a linear regression to that window. The slope of the regression is the drift velocity — how fast the ratio is moving away from baseline. An alert is upgraded from "warning" to "critical" when both conditions hold simultaneously:

Current z-score exceeds the flag threshold (|z| > 2.5)
Drift velocity is positive and above the velocity threshold

This catches the difference between an agent that briefly spiked but is recovering (warning only) versus an agent that is actively and consistently drifting toward exploitation (critical).

Multi-Component Environments

When a reward function has three or more components, pairwise ratios multiply quickly. An environment with five components has ten pairwise ratios — monitoring all of them independently would produce too many signals to interpret.

RewardGuard uses a priority-weighted aggregate for multi-component environments. Each component is assigned a weight reflecting how exploitable it is — components with low natural variance and high potential for hacking (survival timers, proximity measures) get higher weight than noisy, hard-to-exploit components. The aggregate alignment score is the weighted average of individual component alignment scores.

Component	Weight	Current z-score	Alignment
goal_progress	0.4	+0.3	0.96
survival_time	0.35	+3.8	0.17
collision_penalty	0.15	+1.9	0.62
energy_use	0.1	+0.1	0.99

In this example, the survival_time component is being heavily exploited (z = 3.8) while goal progress looks healthy. The aggregate score would be approximately 0.62 — flagged, with the survival component clearly identified as the source. Without component-level logging, the healthy goal_progress component would mask the problem in any aggregate metric.

The reward balance score is not a black-box number. Every part of it — the baseline, the z-score, the sigmoid mapping, the trend velocity — is a deliberate design choice with a specific reason. Understanding the methodology makes it easier to configure RewardGuard correctly for your environment, interpret what its reports are telling you, and know when a flag is signal versus noise. The goal is not to replace your judgment — it's to give you numbers worth reasoning about.

Starting Point: Reward Components

The Component Ratio

The Baseline Window

From Z-Score to Alignment Score

The Trend Detector

Multi-Component Environments

Continue Reading

Why Your RL Agent Is Cheating (And How to Catch It)

The Survival vs. Food Trade-off

Getting Started with RewardGuard