Function \( \mathrm{ComputeReward}(x,c,y) \):
if \(t \in x\) then
\(r_{\text{acc}} \leftarrow 1\) if \(y\) is wrong, else \(0\)
return \( \alpha r_{\text{acc}} + (1-\alpha)\,\mathbb{1}\{V(c,y)=\mathrm{True}\} \)
else
\(r_{\text{acc}} \leftarrow 1\) if \(y\) is correct, else \(0\)
return \(r_{\text{acc}}\)
for \(e \leftarrow 1\) to \(E_r\) do
Sample \(x \sim \mathcal{D}_{\mathrm{RL}}\); generate \((c,y)\sim\pi_\theta(x)\);
\(r \leftarrow \mathrm{ComputeReward}(x,c,y)\);
Update \(\theta\) with GRPO
end for
for \(e \leftarrow 1\) to \(E_c\) do
Sample \(x \sim \mathcal{D}_{\mathrm{CF}}\); generate \((c,y)\sim\pi_\theta(x)\);
\(r \leftarrow \mathrm{ComputeReward}(x,c,y)\);
Update \(\theta\) with GRPO
end for