For simple, single-turn tasks, the RL goal is to maximize the expected reward $R(s, a)$ for generating a response $a$ to a prompt $s$, guided by a policy $\pi_\theta$:
$$ J(\theta) = \mathbb{E}{s \sim \mathcal{D}, a \sim \pi\theta(\cdot|s)}[R(s, a)] $$
However, agent tasks involve multi-turn interactions and environmental randomness. We model this using Markov Decision Processes (MDP) ($\mathcal{M} = \{S, A, P\}$), where states $S$ and actions $A$ are token sequences for LLMs. The agent's policy $\pi_\theta$ chooses actions $a_t \sim \pi_\theta(\cdot|s_t, \text{history})$ leading to new states $s_{t+1} \sim P(\cdot|s_t, a_t)$. The objective then becomes maximizing the expected cumulative reward over the entire interaction:
$$ J_{\text{Interactive}}(\theta) = \mathbb{E} {s \sim \mathcal{D}, a \sim \pi\theta(\cdot|s)}\left[ \sum_t r(s_t, a_t) \right] $$
This trains agents for long-term goals in dynamic environments.
How do we use rewards $r(s_t, a_t)$ to effectively update the LLM policy $\pi_\theta$? Raw rewards can be noisy. Modern RL uses key concepts for stable learning: