Agentic & tool-use RL
Multi-turn trajectories and the 2026 frontier
Papers: The Landscape of Agentic Reinforcement Learning for LLMs: A Survey — Zhang et al., 2025 · Search-R1, ReTool, ToRL / Tool-Star (2025)
Everything so far has optimized a single response: prompt in, one answer out, reward on that answer. But the systems people actually want in 2026 don’t just answer — they act. They search the web, run code, read the result, and decide what to do next, looping until the task is done. Training a model to be good at that is a genuinely different problem, and it’s the live frontier of post-training. This chapter defines the shift and explains why it’s hard.
From answer to trajectory
Agentic RL agentic RL Reinforcement learning over multi-step, tool-using trajectories — the model acts, observes results, and acts again — rather than producing a single response. The 2025–26 frontier. See in glossary → reframes the model from a text generator into a policy acting in an environment. Instead of producing one completion, the model produces a trajectory trajectory The sequence of states and actions in a rollout. For text generation, the tokens generated one after another, each conditioned on those before it. See in glossary → : a sequence of interleaved actions and observations. It writes a search query (action), receives results (observation), reasons over them, writes another query or calls a tool, observes again, and eventually emits a final answer. The reward usually lands only at the end — was the final answer correct? — and the training problem is to push the whole sequence of decisions toward trajectories that succeed.
Tool-use RL tool-use RL Training a model with RL to call external tools (search, code execution, calculators) effectively, rewarding trajectories that use tools to reach correct outcomes. See in glossary → is the most concrete instance: the model’s action space includes calling external tools — a calculator, a code interpreter, a retriever, an API — and the environment returns their outputs as observations the model must then incorporate. This is the engine behind a wave of 2025 systems:
- Search-R1 trains the model to interleave reasoning with live search queries, learning when to look something up and how to use what it finds — RL over a retrieval-augmented trajectory.
- ReTool does the same for a code interpreter: the model learns to write code, run it, read the output (including errors), and use the result to reach the answer.
- ToRL / Tool-Star generalize this to multiple tools, with the model learning which tool to reach for and how to chain calls.
Multi-turn RL multi-turn RL RL where an episode spans many interaction turns (with a user or an environment), requiring credit assignment across turns rather than within one response. See in glossary → is the umbrella: optimizing over many rounds of model–environment interaction rather than a single turn. And once there are many turns, a new question appears — how do you assign credit to individual turns? A turn-level reward turn-level reward A reward assigned to individual turns or tool calls within a multi-turn trajectory, giving denser feedback than a single end-of-episode reward. See in glossary → scores intermediate steps (was this search query useful? did this tool call move things forward?) rather than waiting for the final outcome, giving the optimizer a denser signal to work with.
Why this is genuinely hard
The reasoning-RL machinery from earlier chapters — GRPO, verifiable rewards, group baselines — mostly carries over. What breaks is everything around the reward.
- Credit assignment across turns. If a 20-step agent fails, which step was the mistake? The bad answer might trace to a search query made fifteen steps earlier. Advantage advantage How much better an action was than the baseline expectation: A = reward − value. Positive advantage pushes an action’s probability up, negative pushes it down. See in glossary → estimation over long horizons is exactly the regime where the group-mean baseline baseline A reference value subtracted from the reward to reduce gradient variance without adding bias. Can be a learned critic, a group mean (GRPO), or a leave-one-out average (RLOO). See in glossary → is weakest and a trained value function value function The expected return from a given state under the current policy. A learned value function (the critic) provides a baseline that reduces the variance of policy-gradient updates. See in glossary → (the VAPO impulse) starts to look attractive again.
- Sparse, long-horizon rewards. A single binary “did the task succeed” signal at the end of a long trajectory is a very thin gradient to learn from — most rollouts fail early and teach little. This is the long-standing sparse-reward problem of RL, now with trajectories thousands of tokens long.
- Environment design. You can’t do agentic RL without an environment to act in — a sandboxed shell, a search index, a code runner — and it must be fast (you’ll run millions of rollouts), reproducible, and safe to let a half-trained model loose in. The environment is now part of the training stack, not an afterthought.
- Verifiable rewards for agents. Math and code gave clean verifiers. “Did the agent successfully book the trip / fix the bug / research the question” is far murkier — checking it is itself a hard problem, and a sloppy checker is an open invitation to reward hacking reward hacking When a policy finds ways to score high on the reward model without actually being better — exploiting quirks of an imperfect proxy. A central danger of RL post-training. See in glossary → .
Where the field is heading
The agentic-RL survey (Zhang et al., 2025) frames this as a coherent research program rather than a grab-bag of demos, and its through-line points at 2026. Expect the verifiable-reward idea to be pushed into ever-messier domains via better automatic graders and AI-feedback RLAIF Reinforcement Learning from AI Feedback — replace human preference labels with labels from another model (or the model itself), making the feedback loop cheap and scalable. See in glossary → judges; expect environments (browsers, IDEs, operating systems) to become shared infrastructure the way datasets once did; and expect the critic-vs-critic-free pendulum to keep swinging as horizons lengthen. The unit of optimization has moved from the token (pre-training) to the response (RLHF and reasoning) to the trajectory (agents) — and each move up that ladder has been the field’s main story for a few years at a time.
The final chapter steps back to reassemble the whole pipeline we’ve built, name the handful of levers that explain it, and point you at the papers in the order to read them.