Example tasks across Bridge V2, SIMPLER, and LIBERO.
We observe that action error consistently decreases as we scale the number of generated actions across multiple sampling approaches, assuming the presence of an oracle verifier. Repeatedly sampling actions from robot policies, applying Gaussian perturbation to a few sampled actions, and even random sampling of action tokens all outperform single-attempt OpenVLA.
We also find that the relationship between action error and the number of samples generated through Gaussian Perturbation follows an approximate power law across a range of VLA models, including CogACT, Octo, OpenVLA, and SpatialVLA.
For power law fitting, we model the logarithm of action error e as a function of the number of samples: log(e) ≈ log(a) + b * log(k).
Stage 1: Training the Action Verifier: Given an imitation learning dataset, we sample N candidate actions per state from a generalist robot policy, and apply clustering to reduce them to K representative actions. We construct synthetic action comparisons and assign preferences based on the RMSE between each sampled action and the ground-truth action. This synthetic preference dataset is then used to fine-tune a VLM-based action verifier.
Stage 2: Scaling Test-Time Compute: At deployment, we sample N̂ initial actions from the generalist robot policy based on the given task instruction and observation. We fit a Gaussian distribution to the translation and rotation components of these actions, and use majority voting to determine the gripper state. This creates an action proposal distribution from which we can efficiently sample candidate actions with negligible overhead. Finally, we use the fine-tuned VLM-based verifier to evaluate these K̂ candidate actions and select the optimal action.
Example tasks across Bridge V2, SIMPLER, and LIBERO.
Scaling test-time compute leads to substantial improvements on OOD generalization tasks, achieving a 25% absolute improvement
RoboMonkey improves the precision of generalist robot policies in the SIMPLER environment, leading to 9% higher average success rate on in-distribution tasks
Fine-tuning both OpenVLA and RoboMonkey action verifier results in 7% improvement in average success rate compared to simply fine-tuning OpenVLA on LIBERO-Long
OpenVLA ❌
V-GPS ❌
RoboMonkey ✅
OpenVLA ❌
V-GPS ❌
RoboMonkey ✅
OpenVLA ❌
V-GPS ❌
RoboMonkey ✅
Repeated sampling can exploit KV Cache optimizations and batch processing to achieve higher throughput than greedy decoding. Therefore, we extended SGLang’s capabilities to properly support OpenVLA. Our optimized implementation substantially outperforms the naive OpenVLA inference pipeline, achieving lower latency and significantly higher throughput across batch sizes.
Gaussian perturbation applied to a small set of actions is more efficient than naively sampling actions from robot policies when constructing action proposal distributions. RoboMonkey can sample and verify 16 candidate actions in 650 ms (or 1.5 Hz).
Average success rates across four tasks on SIMPLER as a function of synthetic dataset size. Scaling the dataset size (number of synthetic action comparisons) consistently improves the performance of the RoboMonkey verifier, leading to higher closed-loop success rates.
@InProceedings{pmlr-v305-kwok25a,
title = {RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models},
author = {Kwok, Jacky and Agia, Christopher and Sinha, Rohan and Foutter, Matt and Li, Shulu and Stoica, Ion and Mirhoseini, Azalia and Pavone, Marco},
booktitle = {Proceedings of The 9th Conference on Robot Learning},
pages = {3200--3217},
year = {2025},
editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won},
volume = {305},
series = {Proceedings of Machine Learning Research},
month = {27--30 Sep},
publisher = {PMLR},
pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/kwok25a/kwok25a.pdf},
url = {https://proceedings.mlr.press/v305/kwok25a.html},
abstract = {Vision-Language-Action (VLA) models, pre-trained on large-scale imitation learning datasets, have demonstrated remarkable capabilities in visuomotor control. However, these models exhibit diverse failure modes in unstructured real-world environments, limiting the widespread adoption of VLAs in robotics. Efforts to enhance the robustness and generalization of VLAs have gradually shifted from the pre-training to the post-training phase. Yet, the potential of scaling test-time compute remains underexplored. In this paper, we investigate test-time scaling for robotics through the lens of sampling and verification. We first demonstrate that the relationship between action error and the number of generated samples follows an exponentiated power law across a range of VLAs, indicating the existence of inference-time scaling laws. Building on this insight, we propose a synthetic data generation pipeline for training a Vision-Language Model (VLM)-based action verifier, and show that scaling the synthetic dataset consistently improves verification and downstream accuracy. We then introduce RoboMonkey, a test-time scaling framework for VLAs. At deployment, RoboMonkey samples a small set of actions from a VLA, applies Gaussian perturbations and majority voting to construct an action proposal distribution, and then uses the VLM-based verifier to select the optimal action. Through extensive evaluations across simulated and real-world environments, we show that pairing existing VLAs with RoboMonkey yields significant performance gains, achieving a 25% absolute improvement on out-of-distribution tasks and 8% higher average success rate on in-distribution tasks. Additionally, when adapting to new robot setups, we show that fine-tuning both VLAs and action verifiers yields a 7% performance increase compared to fine-tuning VLAs alone.}
}
}