In our 5th RL Debates presentation, Anne argued that reward-based learning is not always driven by RL computations. Sometimes it’s working memory combined with outcome-insensitive habit formation, mimicking RL without computing values.

Anne’s core argument is that RL, as traditionally modeled, fails to capture the actual cognitive processes underlying reward-based learning. She presented compelling evidence that human learning in instrumental tasks relies on two distinct, parallel systems: a fast, capacity-limited working memory (WM) system that integrates outcome valence, and a slower outcome-insensitive associative system that strengthens stimulus-action bonds regardless of whether outcomes are positive or negative.

Critically, neither system alone implements RL computations. The WM system is too capacity-limited and forgetful, while the habit system completely ignores outcome valence. Yet together, they produce RL-like behavior because WM bootstraps good action selection, which the habit system then reinforces through repetition. Anne’s models show this multi-system approach is necessary to explain a key empirical signature: in high cognitive load conditions (large “set sizes”), participants show no sensitivity to negative outcomes: they don’t avoid previously unrewarded errors, contradicting standard RL predictions.

Watch the full meeting here:

Updated: