Example tasks across Bridge V2, SIMPLER, and LIBERO.
We observe that action error consistently decreases as we scale the number of generated actions across multiple sampling approaches, assuming the presence of an oracle verifier. Repeatedly sampling actions from robot policies, applying Gaussian perturbation to a few sampled actions, and even random sampling of action tokens all outperform single-attempt OpenVLA.
We also find that the relationship between action error and the number of samples generated through Gaussian Perturbation follows an approximate power law across a range of VLA models, including CogACT, Octo, OpenVLA, and SpatialVLA.
For power law fitting, we model the logarithm of action error e as a function of the number of samples: log(e) ≈ log(a) + b * log(k).
Stage 1: Training the Action Verifier: Given an imitation learning dataset, we sample N candidate actions per state from a generalist robot policy, and apply clustering to reduce them to K representative actions. We construct synthetic action comparisons and assign preferences based on the RMSE between each sampled action and the ground-truth action. This synthetic preference dataset is then used to fine-tune a VLM-based action verifier.
Stage 2: Scaling Test-Time Compute: At deployment, we sample N̂ initial actions from the generalist robot policy based on the given task instruction and observation. We fit a Gaussian distribution to the translation and rotation components of these actions, and use majority voting to determine the gripper state. This creates an action proposal distribution from which we can efficiently sample candidate actions with negligible overhead. Finally, we use the fine-tuned VLM-based verifier to evaluate these K̂ candidate actions and select the optimal action.
Example tasks across Bridge V2, SIMPLER, and LIBERO.
Scaling test-time compute leads to substantial improvements on OOD generalization tasks, achieving a 25% absolute improvement
RoboMonkey improves the precision of generalist robot policies in the SIMPLER environment, leading to 8% higher average success rate on in-distribution tasks
Fine-tuning both OpenVLA and RoboMonkey action verifier results in 7% improvement in average success rate compared to simply fine-tuning OpenVLA on LIBERO-Long
OpenVLA ❌
V-GPS ❌
RoboMonkey ✅
OpenVLA ❌
V-GPS ❌
RoboMonkey ✅
OpenVLA ❌
V-GPS ❌
RoboMonkey ✅
OpenVLA ❌
V-GPS ❌
RoboMonkey ✅
Repeated sampling can exploit KV Cache optimizations and batch processing to achieve higher throughput than greedy decoding. Therefore, we extended SGLang’s capabilities to properly support OpenVLA. Our optimized implementation substantially outperforms the naive OpenVLA inference pipeline, achieving lower latency and significantly higher throughput across batch sizes.
Gaussian perturbation applied to a small set of actions is more efficient than naively sampling actions from robot policies when constructing action proposal distributions. RoboMonkey can sample and verify 16 candidate actions in 650 ms (or 1.5 Hz).
Average success rates across four tasks on SIMPLER as a function of synthetic dataset size. Scaling the dataset size (number of synthetic action comparisons) consistently improves the performance of the RoboMonkey verifier, leading to higher closed-loop success rates.
@article{kwok25robomonkey,
title={RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models},
author={Jacky Kwok and Christopher Agia and Rohan Sinha and Matt Foutter and Shulu Li and Ion Stoica and Azalia Mirhoseini and Marco Pavone},
journal={arXiv preprint arXiv:2506.17811},
year={2025},
}