1. Introduction

2. Background

2.1 Reinforcement Learning in Large Language Models

2.1.1 Objective

For simple, single-turn tasks, the RL goal is to maximize the expected reward $R(s, a)$ for generating a response $a$ to a prompt $s$, guided by a policy $\pi_\theta$:

$$ J(\theta) = \mathbb{E}{s \sim \mathcal{D}, a \sim \pi\theta(\cdot|s)}[R(s, a)] $$

However, agent tasks involve multi-turn interactions and environmental randomness. We model this using Markov Decision Processes (MDP) ($\mathcal{M} = \{S, A, P\}$), where states $S$ and actions $A$ are token sequences for LLMs. The agent's policy $\pi_\theta$ chooses actions $a_t \sim \pi_\theta(\cdot|s_t, \text{history})$ leading to new states $s_{t+1} \sim P(\cdot|s_t, a_t)$. The objective then becomes maximizing the expected cumulative reward over the entire interaction:

$$ J_{\text{Interactive}}(\theta) = \mathbb{E} {s \sim \mathcal{D}, a \sim \pi\theta(\cdot|s)}\left[ \sum_t r(s_t, a_t) \right] $$

This trains agents for long-term goals in dynamic environments.

2.1.2 Optimization Methods - From Rewards to Policy Updates

How do we use rewards $r(s_t, a_t)$ to effectively update the LLM policy $\pi_\theta$? Raw rewards can be noisy. Modern RL uses key concepts for stable learning:

Advantage $A(s_t, a_t)$: Measures how much better an action $a_t$ is compared to the average action in state $s_t$. Updating based on advantage is more stable than using raw rewards.
Value Function $V(s_t)$: Estimates expected long-term reward from state $s_t$ from current policy.
Actor-Critic: An Actor (the policy $\pi_\theta$) selects actions, while a Critic ($V_\phi$) learns the Value function $V(s_t)$ to help estimate the Advantage.
GAE (Generalized Advantage Estimation): A technique to calculate more reliable advantage estimates, reducing variance.