RewardGuard is designed to drop into existing training workflows with minimal friction. You don't need to change your model architecture, your optimizer, or your reward function. You just need to tell RewardGuard what your reward components are and log them at each step — it handles the rest.
This tutorial uses a simple PyTorch RL loop, but the same pattern works with JAX, any gym-compatible environment, and Stable Baselines 3.
1
Install the Package
The free package is available on PyPI:
pip install rewardguard
For the premium package (auto-adjustment), you'll need a license key from your dashboard:
pip install rewardguard-premium
2
Identify Your Reward Components
Before adding monitoring, identify the distinct components that make up your reward signal. If your reward function looks like this:
def compute_reward(state, action, next_state):
survival = 1.0 # always positive while alive
goal_dist = -0.1 * distance_to_goal(next_state)
food_bonus = 10.0 if reached_food(next_state) else 0.0
death_pen = -50.0 if is_terminal(next_state) else 0.0
return survival + goal_dist + food_bonus + death_pen
The components are survival, goal_dist, food_bonus, and death_pen. RewardGuard needs each of these separately — not just the total.
3
Initialize the Monitor
import rewardguard as rg
monitor = rg.Monitor(
components=["survival", "goal_dist", "food_bonus", "death_penalty"],
window=500, # analysis window in steps
primary="food_bonus", # the component that should dominate
threshold=8.0, # alert ratio (passive/primary)
confidence=0.90, # minimum confidence to flag
)
The primary parameter tells RewardGuard which component represents genuine task progress. Components with higher accumulated reward than the primary component by the threshold factor will trigger an alert.
4
Log Components in Your Training Loop
Add a single logging call inside your step loop:
for episode in range(num_episodes):
state = env.reset()
done = False
while not done:
action = policy.act(state)
next_state, _, done, info = env.step(action)
# Compute your reward components
survival = 1.0 if not done else 0.0
food_bonus = info.get("food_collected", 0) * 10.0
goal_dist = -0.1 * info["dist_to_goal"]
death_pen = -50.0 if done else 0.0
# Log to RewardGuard (one extra line)
monitor.log(
survival=survival, food_bonus=food_bonus,
goal_dist=goal_dist, death_penalty=death_pen
)
state = next_state
# Check for issues every episode
report = monitor.analyze()
if report.detects_hacking():
print(report.summary())
break
5
Reading the Report
When RewardGuard detects a problem, the summary looks like this:
RewardGuard v2.1.0 — Analysis Report
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠ REWARD HACKING DETECTED
Confidence: 97.3%
Window: steps 340–840
Component Ratios (vs. primary: food_bonus)
─────────────────────────────────────────
survival accumulated: 482.0 ratio: 48.2:1 ← EXPLOIT
goal_dist accumulated: -12.4 ratio: n/a
food_bonus accumulated: 10.0 ratio: 1.0
Diagnosis:
Agent is farming survival reward without engaging
the primary objective (food_bonus). The survival
component is unbounded relative to food_bonus.
Suggested Fix:
↓ Reduce survival weight OR ↑ Increase food_bonus weight
Target ratio: survival/food_bonus ≤ 8.0:1
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
6
Export a Full Report (Optional)
# Export to JSON for logging/CI integration
report.export("audit_run_42.json")
# Export to PDF for sharing with your team
report.export("audit_run_42.pdf", format="pdf")
Premium: Auto-Adjustment
With a premium license, replace rg.Monitor with rg.PremiumMonitor and add auto_adjust=True. When hacking is detected, the monitor will automatically rebalance your reward weights without stopping training. The adjustment is logged to the report.
Integrating with CI/CD
For production workflows, you want monitoring to fail the run automatically if reward hacking is detected above a severity threshold. The report object exposes a severity score from 0 to 1:
report = monitor.analyze()
if report.severity > 0.8:
raise RuntimeError(
f"Training aborted: reward hacking severity {report.severity:.2f}"
)
Add this check after each evaluation step in your training loop, and your CI system will catch reward hacking before the run completes — saving compute and giving you a clear signal about what went wrong.
That's it. Three objects (Monitor, log(), analyze()), one extra line per training step, and you have continuous reward balance monitoring integrated into your existing loop. The free package gives you detection and diagnosis. The premium package closes the loop with automatic correction.