richard sutton on oak at rlc 2025
Richard Sutton (home page, wikipedia) gave a keynote talk titled The OaK Architecture: A Vision of SuperIntelligence from Experience at the 2025 Reinforcement Learning Conference (RLC) held at the University of Alberta. This talk was fun. It was informal and unpolished and felt like someone trying to explain ideas they are excited about but haven’t totally worked out yet. Below is a summary that uses some quotes from the slides.
Sutton starts off by stating his perspective that super inelligent agents are coming, that they will be good for the world, that the path to creating them runs through reinforcement learning, and that the biggest bottleneck is inadequate learning algorithms.
He outlines his role as one who has tried to think deeply about intelligence for fifty years and references The Alberta Plan for AI Research (paper, slides) he developed along with Mike Bowling and Patrick Pilarski. The OaK acronym stands for “Options and Knowledge” and is a component of The Alberta Plan consituting a vision for an overall agent architecture.
In OaK, an option is a pair $(\pi, \gamma)$ where $\pi$ is a policy (a way of behaving) and $\gamma$ is a termination condiation (a way of terminating the behavior). Slightly more technically, a policy $\pi$ maps states to a probability distribution over actions and a termination condition $\gamma$ maps states to 0 or 1. Knowledge is what the agent learns when different options are followed until termination.
The agent has three main design goals,
- domain general: contains nothing specific to any world
- experiential: grows from runtime experience, not from a special training phase
- open ended: unlimited in sophistication (other than by computation resources)
Sutton focuses on the importance of continuous learning during runtime vs designtime. He references the The Big World Hypothesis and The Bitter Lesson in advocating for systems that are designed with only the meta-methods for learning as opposed to any specific domain knowledge.
We want AI agents that can discover like we can, not which contain what we have discovered The Bitter Lesson
After going into more details on the big world hypothesis, Sutton introduces the reward hypothesis,
all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward). Settling the Reward Hypothesis
He stresses that his goal is simplicity and that he wants a system that can learn generally using a scalar reward. Next he presents a series of architectures that lead to OaK,
- model-free: Basic RL. The agent constructs an approximate policy and/or value function, both a function of the world’s state.
- non-markov: Better. The agent constructs its state representation (e.g. as a feature vector)
- model-based: Better. The agent constructs a transition model o fthe world and uses it to plan a better policy and/or value function
- OaK:
- The agent poses auxiliary subproblems for attaining individual features
- These enable the discovery of high and higher levels of abstraction limited only by computational resources
Next, an outline of the OaK architecture is presented as a set of steps to be performed in parallel,
- ✔ Learn the policy and value function for maximizing reward
- ✔ Generate new state features from existing state features
- ✔ Rank-order the features
- ✔ Create subproblems, one for each highly-ranked feature
- ✔ Learn solutions to subproblems (options and sub-value functions)
- ✔ Learn transition models of the options and actions
- ✔ Plan with the transition models
- ✔ Maintain meta-data on the utility of everything; curate
With the checkmarks indicating Sutton’s estimation on how much progress has been made for each,
- ✔ Would be done if we could do continual deep learning and meta-learning
- ✔ Lots of ideas, but no specific proposal
- ✔ Seems easy but can’t be done until the rest are done
- ✔ Seems done
The rest of the talk goes into more technical details on these points with the majority of it focused on how the OaK system might go about creating and solving its own subproblems. It was refreshing to watch a researcher present a grand ambituous plan without relying on hype and being realistic about what parts need work. I’ve collected some of the references below, but there are more to be found in the talk.