GRPO Advantage Visualizer
Number of completions (n)
Reward slider min
Reward slider max
Epsilon (stability)
Reset rewards to 0
Randomize rewards (Uniform)
Z-score variant: \(A_i = \dfrac{r_i - \bar{r}}{\sigma_r + \varepsilon}\). When \(\sigma_r = 0\), all advantages are set to 0.
Mean-centered variant: \(A_i = r_i - \bar{r}\) (no variance normalization).