Northwestern MLL Lab

Code: https://github.com/RAGEN-AI/VAGEN

Blog: https://mll-lab.notion.site/vagen

Experiment Log: https://api.wandb.ai/links/ragen-V/nlb40e7l

— Mar 25, 2025

<aside> 💡

We introduce VAGEN, a multi-turn RL framework for training Vision-Language Model (VLM) Agents. Recent RL frameworks for LLM agents, including RAGEN, Search-R1, Agent-R1, and OpenManus-RL, use Reason-Interaction Chain Optimization (RICO) to train LLM agents, which treats all the tokens in the trajectory equally in training objective.

While RICO is effective for language-only agents, we find it inefficient for VLM agents, possibly due to two key limitations: (1) Most VLMs aren’t pretrained to generate image tokens, causing a distribution shift when RL fine-tuning with RICO introduces multimodal outputs; (2) Visual tasks suffer from state redundancy in long-context image and low-level details, undermining RICO’s state-learning objective.

Our approach addresses these challenges through the Turn-aware Reason-Interaction Chain Optimization (TRICO) algorithm. TRICO outperforms previous approaches on visual agentic tasks through: (1) selective token masking focusing on action-critical tokens, and (2) cross-turn credit assignment using step rewards and bi-level advantage estimation. Experiments show both mechanisms lead to significant improvement in visual agentic tasks.

</aside>

VAGEN: Multi-turn VLM Agent RL Framework

Overview of the TRICO algorithm in VAGEN, which extends RICO with two key innovations: (1) selective token masking that focuses optimization on action-critical tokens through $M^{\text{loss}}$ and $M^{\text{adv}}$ masks, and (2) cross-turn credit assignment that uses different discount factors ($\gamma_{\text{turn}}$ and $\gamma_{\text{token}}$) for cross-turn and within-turn calculations, enabling more effective credit assignment in multi-turn visual agentic tasks.

Overview of the TRICO algorithm in VAGEN, which extends RICO with two key innovations: (1) selective token masking that focuses optimization on action-critical tokens through $M^{\text{loss}}$ and $M^{\text{adv}}$ masks, and (2) cross-turn credit assignment that uses different discount factors ($\gamma_{\text{turn}}$ and $\gamma_{\text{token}}$) for cross-turn and within-turn calculations, enabling more effective credit assignment in multi-turn visual agentic tasks.

Methodology Overview

RICO extends traditional PPO by treating multi-turn agent-environment interactions as a unified token stream (highlighted in red). It processes the entire interaction trajectory as a single sequence, enabling end-to-end training of both reasoning and action selection with rewards propagated through all tokens.

RICO extends traditional PPO by treating multi-turn agent-environment interactions as a unified token stream (highlighted in red). It processes the entire interaction trajectory as a single sequence, enabling end-to-end training of both reasoning and action selection with rewards propagated through all tokens.

1. Background of RICO: Reason-Interaction Chain Optimization

Recent advancements in reinforcement learning for language model agents have introduced RICO (Reason-Interaction Chain Optimization), which treats multi-turn interactions as a unified token stream. RICO processes the entire interaction trajectory $(s_0, a_0^T, s_1, a_1^T, \ldots, s_K, a_K^T)$ as a single sequence, where $s_k$ represents environment states and $a_k^T$ denotes agent responses with reasoning.

RICO structures agent responses as <think>...</think><ans>...</ans><eoa>, where the thinking component enables the agent to perform explicit reasoning before taking actions, while only the <ans> portion is executed in the environment. RICO provides a general framework that can be applied widely to pre-existing RL algorithms for LLM training. Without loss of generality, the PPO version of RICO applies standard PPO objectives uniformly across all tokens, with advantages computed using Generalized Advantage Estimation (GAE) with a single discount factor $\gamma$. The final environmental reward is propagated back through the entire token sequence, supplemented by KL-divergence penalties to maintain proximity to a reference policy.

While effective for language-only agents, RICO treats all tokens equally, which becomes suboptimal in multimodal VLM agent scenarios where action-critical tokens carry significantly different importance in decision-making processes.

Algorithm details of TRICO. Selective token masking (highlighted in orange) focuses optimization on action-critical tokens. Cross-turn credit assignment (highlighted in purple) uses different discount factors for cross-turn and within-turn calculations with more effective credit assignment in multi-turn visual agentic tasks.

Algorithm details of TRICO. Selective token masking (highlighted in orange) focuses optimization on action-critical tokens. Cross-turn credit assignment (highlighted in purple) uses different discount factors for cross-turn and within-turn calculations with more effective credit assignment in multi-turn visual agentic tasks.

2. TRICO: Turn-aware Reason-Interaction Chain Optimization

In our early experiments, we find RICO inefficient for visual agent training due to two key limitations: (1) Most VLMs don't learn to output image tokens in pretraining, creating a distribution shift when RL fine-tuning attempts to learn multimodal rather than text-only outputs; (2) Visual agentic tasks contain significant state redundancy with long-context visual inputs and excessive low-level information, making RICO's state-learning objective inefficient.

To improve visual agent training with RL, we introduce TRICO (Turn-aware Reason-Interaction Chain Optimization), which extends RICO with two key innovations specifically designed for VLM agents: