Hacker Newsnew | past | comments | ask | show | jobs | submit | jonathanlight's commentslogin

Hi HN — I'm one of the authors.

This paper looks at a problem that comes up in RL post-training of large models: the training data mixture (or curriculum) is often manually tuned and static, even though the policy keeps changing during training.

We propose Actor-Curator, a framework where a learned "curator" adaptively selects training problems while the actor policy is being optimized. The curator is trained to maximize a policy-improvement objective, effectively learning which data is most useful for improving the policy at each stage of training.

Conceptually it’s a co-adaptive system: - the actor learns the policy - the curator learns the training curriculum

Happy to answer questions or discuss!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: