Section 26

Agentic & tool-use RL

Multi-turn trajectories and the 2026 frontier

Papers: The Landscape of Agentic Reinforcement Learning for LLMs: A Survey — Zhang et al., 2025 · Search-R1, ReTool, ToRL / Tool-Star (2025)

Everything so far has optimized a single response: prompt in, one answer out, reward on that answer. But the systems people actually want in 2026 don’t just answer — they act. They search the web, run code, read the result, and decide what to do next, looping until the task is done. Training a model to be good at that is a genuinely different problem, and it’s the live frontier of post-training. This chapter defines the shift and explains why it’s hard.

From answer to trajectory

Agentic RL reframes the model from a text generator into a policy acting in an environment. Instead of producing one completion, the model produces a trajectory : a sequence of interleaved actions and observations. It writes a search query (action), receives results (observation), reasons over them, writes another query or calls a tool, observes again, and eventually emits a final answer. The reward usually lands only at the end — was the final answer correct? — and the training problem is to push the whole sequence of decisions toward trajectories that succeed.

Tool-use RL is the most concrete instance: the model’s action space includes calling external tools — a calculator, a code interpreter, a retriever, an API — and the environment returns their outputs as observations the model must then incorporate. This is the engine behind a wave of 2025 systems:

Search-R1 trains the model to interleave reasoning with live search queries, learning when to look something up and how to use what it finds — RL over a retrieval-augmented trajectory.
ReTool does the same for a code interpreter: the model learns to write code, run it, read the output (including errors), and use the result to reach the answer.
ToRL / Tool-Star generalize this to multiple tools, with the model learning which tool to reach for and how to chain calls.

Multi-turn RL is the umbrella: optimizing over many rounds of model–environment interaction rather than a single turn. And once there are many turns, a new question appears — how do you assign credit to individual turns? A turn-level reward scores intermediate steps (was this search query useful? did this tool call move things forward?) rather than waiting for the final outcome, giving the optimizer a denser signal to work with.

Why this is genuinely hard

The reasoning-RL machinery from earlier chapters — GRPO, verifiable rewards, group baselines — mostly carries over. What breaks is everything around the reward.

Credit assignment across turns. If a 20-step agent fails, which step was the mistake? The bad answer might trace to a search query made fifteen steps earlier. Advantage estimation over long horizons is exactly the regime where the group-mean baseline is weakest and a trained value function (the VAPO impulse) starts to look attractive again.
Sparse, long-horizon rewards. A single binary “did the task succeed” signal at the end of a long trajectory is a very thin gradient to learn from — most rollouts fail early and teach little. This is the long-standing sparse-reward problem of RL, now with trajectories thousands of tokens long.
Environment design. You can’t do agentic RL without an environment to act in — a sandboxed shell, a search index, a code runner — and it must be fast (you’ll run millions of rollouts), reproducible, and safe to let a half-trained model loose in. The environment is now part of the training stack, not an afterthought.
Verifiable rewards for agents. Math and code gave clean verifiers. “Did the agent successfully book the trip / fix the bug / research the question” is far murkier — checking it is itself a hard problem, and a sloppy checker is an open invitation to reward hacking .

Where the field is heading

The agentic-RL survey (Zhang et al., 2025) frames this as a coherent research program rather than a grab-bag of demos, and its through-line points at 2026. Expect the verifiable-reward idea to be pushed into ever-messier domains via better automatic graders and AI-feedback judges; expect environments (browsers, IDEs, operating systems) to become shared infrastructure the way datasets once did; and expect the critic-vs-critic-free pendulum to keep swinging as horizons lengthen. The unit of optimization has moved from the token (pre-training) to the response (RLHF and reasoning) to the trajectory (agents) — and each move up that ladder has been the field’s main story for a few years at a time.

The final chapter steps back to reassemble the whole pipeline we’ve built, name the handful of levers that explain it, and point you at the papers in the order to read them.