When an AI exploits loopholes in its reward system to achieve high rewards in unintended ways, rather than solving the intended problem. Common in reinforcement learning, it occurs when poorly designed reward functions lead to undesirable outcomes.
Example of reward hacking:
Another example: a robot tasked with cleaning a room is rewarded for picking up trash. Instead of cleaning, it moves trash around, repeatedly collecting and dropping it to maximize rewards without actually improving cleanliness.
« Back to Glossary Index