jonathanlight's comments

jonathanlight · 2026-03-04T17:16:38 1772644598

Hi HN — I'm one of the authors.

This paper looks at a problem that comes up in RL post-training of large models: the training data mixture (or curriculum) is often manually tuned and static, even though the policy keeps changing during training.

We propose Actor-Curator, a framework where a learned "curator" adaptively selects training problems while the actor policy is being optimized. The curator is trained to maximize a policy-improvement objective, effectively learning which data is most useful for improving the policy at each stage of training.

Conceptually it’s a co-adaptive system: - the actor learns the policy - the curator learns the training curriculum

Happy to answer questions or discuss!