We are working to apply the ideas of R1 to computer use. The primary struggle is creating reliable neural reward models since hard-verification rewards are not available at scale in GUI interactions.
Our team is currently deep in the weeds of collecting reasoning annotation data for GUI interfaces to train a reliable reward model.
We would love all thoughts, feedback, and collaborations!