Human intelligence involves metacognitive abilities like self-regulation, recognizing limitations, and seeking assistance only when needed. While LLM Agents excel in many domains, they often lack this awareness. Overconfident agents risk catastrophic failures, while those that seek help excessively hinder efficiency. A key challenge is enabling agents with a limited intervention budget is to decide when to request assistance.
This involves balancing
reward design and
policy optimization: overly incentivizing help requests exhausts the budget prematurely, while under-incentivizing them leads to avoiding help altogether. Designing effective reward functions is non-trivial and can require the costly iterations of training, evaluation, and adjustment. Similarly, generating annotated trajectories for supervised fine-tuning under budget constraints is resource-intensive, as the trajectory space is exponentially large, and even human annotators can struggle to identify optimal intervention timing for every different budget constraint.
In this paper, we propose an offline framework that trains a "helper" policy to request interventions, such as
more powerful models or
test-time compute, by
combining LLM-based process reward models (PRMs) with tabular reinforcement learning. Using state transitions collected offline, we score optimal intervention timing with PRMs and train the helper model on these labeled trajectories. This
offline approach significantly reduces costly intervention calls during training. Furthermore, the integration of PRMs with tabular RL enhances robustness to off-policy data while avoiding the inefficiencies of deep RL. We empirically find that our method delivers optimal helper behavior.
Task

We use the Situated Instruction Following (SIF) task, which requires finding objects, interacting with humans, and performing household tasks in highly uncertain and dynamic environments. To the best of our knowledge, SIF is among the most suitable benchmarks for evaluating how well LLM-driven agents handle nuanced and uncertain instructions. The environment and the instructions become uncertain because the speaker’s intent is not always fully specified, and the human may dynamically alter the scene (e.g., placing an object in a different room or moving to a new location). Even advanced models like GPT-4o struggle with these tasks due to such inherent ambiguities.
We focus on two key strategies that can achieve reliable agent behaviors: self-regulation and requesting interventions.
Self-regulation involves the agent autonomously deciding to stop task execution when it cannot successfully complete the task. This prevents the agent from continuing in scenarios where failure is likely, thereby conserving resources and maintaining reliability. In contrast,
requesting interventions refers to the agent making state-wise requests for assistance during specific states within the task, receiving help at critical points.
Self-Regulation: Use of PRM's and Limitation
We measure the difficulty of a state \(s\) as \(1 - p(s)\), where \(p(s)\) is the probability of success from \(s\) up to a terminal state. To decide when to self-regulate, we train a Process Reward Model (PRM) by rolling out trajectories of base actors. Concretely, a 3B LLaMA model with a scalar head is fine-tuned (SFT) on \((\text{state}, \text{outcome})\) pairs, where the outcome is binary success or failure for the trajectory originating at that state; the PRM is trained to learn \(p(s)\). Finally, we calibrate the PRM’s threshold using a held-out validation set.
To evaluate self-regulation performance, we first identify the maximum \((1 - \text{PRM score})\) encountered in a trajectory before the final step. This value serves as a proxy for \(1 - p(s)\) and reflects the PRM’s estimation of the most difficult state in the trajectory. We then set a threshold on this maximum \((1 - \text{PRM score})\) to optimize overall accuracy (binary success/failure) on a held-out validation set. Table 1 reports the accuracy, precision, and recall metrics (labeling task success as 1) on the test set for two base actors, GPT-4o-mini and LLaMA, each with its own separately trained PRM. These results demonstrate high precision and recall, indicating that the PRM score is indeed a strong indicator of \(p(s)\).
Our Method for Requesting Interventions

The key components of our method are as follows (Fig.4):
Transition Model Collection,
Dynamic Programming (DP) for Usage/Policy Iteration (Fig.4 (a)),
Reward Search (Fig.4 (b)), and
Final Training (Fig.4 (c)). We derive the usage/policy iteration algorithm, effectively equivalent to value iteration, by decomposing the value function into
success and
usage/emph> components. Unlike standard value iteration, usage/policy iteration is offline and integrates PRMs for robust classical RL.
This formulation gives us several advantages. First, the DP step is extremely fast, completing in minutes without requiring GPUs or additional intervention requests. We do NOT have to train the policy itself for reward search; we only need to repeat the quick DP process. Second, the off-line nature enables adaptability across budgets. Intervention data only needs to be collected once in the transition model collection, and can be reused to obtain trajectories for different budgets. Third, using learned PRMs with tabular RL enhances robustness while avoiding the inefficiencies of deep RL.
Please see the paper for more detailed derivation.
Results
Table 2 compares our method to baselines in terms of success rate (\(\mathrm{SR}\)), path-length weighted success (\(\mathrm{SPL}\)), task execution length (\(L\)), observed intervention usage (\(U\)), and expected intervention usage (\(\mathbb{E}[U]\)). We train our approach using different reward scale values (\(r\) high, mid, low), inducing varying intervention frequencies.
With just a fraction of the interventions used by a policy that always intervenes (7.8 and 4.5 times on average), our method nearly matches that policy’s performance. For example, in S_obj, we achieve a \(62.5\%\) success rate using only \(1.0\) intervention on average, outperforming baselines with similar or higher usage. Moreover, \(\mathbb{E}[U]\) closely matches observed usage, especially for smaller \(U\) (e.g. \(U\) is 0.4 and \(\mathbb{E}[U]\) is also 0.4). They tend to diverge more with higher \(r\)'s, but \(\mathbb{E}[U]\) still provides good expectations of the model's intervention usage, allowing us to select \(r\) based on training data alone, without exhaustive training and evaluation.
Table 3 compares our method and baselines on the more challenging S_obj split, evaluating three intervention setups: a better model, MCTS, or both. In general, the trends from Table 2 hold here as well. Our method optimally calls interventions, whether MCTS or both, achieving higher performance than baselines while using fewer interventions (e.g., with only 0.5 MCTS calls on average, we match the success rate of a 30% PRM calibration baseline that uses 2.5 calls).
Please see our paper for more results and discussion!
Paper and Bibtex
[Paper]
|
|
Citation Min, S., Wu, Y., Sun, J., Kaufmann, M., Tajwar, F., Bisk, Y., Salakhutdinov, R. (2025).
Self-Regulation and Requesting Interventions.
@misc{min2025selfregulationrequestinginterventions,
title={Self-Regulation and Requesting Interventions},
author={So Yeon Min and Yue Wu and Jimin Sun and Max Kaufmann
and Fahim Tajwar and Yonatan Bisk and Ruslan Salakhutdinov},
year={2025},
eprint={2502.04576},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.04576},
}
|
|
|
|