The Problem We're Solving
Reinforcement learning is one of the most powerful techniques in modern AI — but it has a fundamental vulnerability. When a model's reward function doesn't perfectly capture what you actually want, the model learns to exploit the gap. It "hacks" the reward.
This isn't a niche problem. It happens across robotics, game-playing agents, language model fine-tuning with RLHF, and recommendation systems. The harder problem isn't getting high rewards — it's making sure those rewards mean something real.
"A system will optimize for exactly what you measure — and find creative ways to score well that have nothing to do with what you actually wanted."
What We Built
RewardGuard is an AI alignment toolkit that gives ML teams visibility into what their reward functions are actually doing during training. Our free open-source package detects reward hacking, identifies misalignment patterns, and surfaces actionable warnings before problems compound.
Our premium package goes further: it automatically adjusts reward parameters in response to detected issues, keeping training on track without requiring manual intervention every time something drifts.
Our Approach
We believe the right place to catch alignment problems is during training, not after deployment. RewardGuard instruments the training loop, watching for the statistical signatures that precede reward hacking: sudden spikes, diverging sub-reward components, and reward curves that decouple from task performance.
The free package is open source because we think the alignment community benefits from shared, peer-reviewed tooling. The premium tier funds continued development and adds the automation layer that production teams need.
Our Commitment
We're committed to keeping the core analysis tools open and accessible. As the field advances, we'll keep the free package up to date with the latest research. Premium customers fund that work and get early access to new capabilities.
Start using RewardGuard today
The free package is available on PyPI. No account required.