Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in visuomotor control, yet ensuring their robustness in unstructured real-world environments remains a persistent challenge. In this paper, we investigate test-time scaling through the lens of sampling and verification as means to enhance the robustness and generalization of VLAs. We first demonstrate that the relationship between action error and the number of generated samples follows an exponentiated power law across a range of VLAs, indicating the existence of inference-time scaling laws. Building on these insights, we introduce RoboMonkey, a test-time scaling framework for VLAs. At deployment, RoboMonkey samples a small set of actions from a VLA, applies Gaussian perturbation and majority voting to construct an action proposal distribution, and then uses a VLM-based verifier to select the optimal action. We propose a synthetic data generation pipeline for training such Vision Language Model (VLM)-based verifiers, and demonstrate that scaling the synthetic dataset consistently improves verification and downstream accuracy. Through extensive simulated and hardware experiments, we show that pairing existing VLAs with RoboMonkey yields significant performance gains, achieving a 25% absolute improvement on out-of-distribution tasks and 8% on in-distribution tasks. Additionally, when adapting to new robot setups, we show that fine-tuning both VLAs and action verifiers yields a 7% performance increase compared to fine-tuning VLAs alone.

We observe that action error consistently decreases as we scale the number of generated actions across multiple sampling approaches, assuming the presence of an oracle verifier. Repeatedly sampling actions from robot policies, applying Gaussian perturbation to a few sampled actions, and even random sampling of action tokens all outperform single-attempt OpenVLA. We also find that the relationship between action error and the number of samples generated through Gaussian Perturbation follows an approximate power law across a range of VLA models, including CogACT, Octo, OpenVLA, and SpatialVLA. For power law fitting, we model the logarithm of action error e as a function of the number of samples: log(e) ≈ log(a) + b * log(k).

Stage 1: Training the Action Verifier: Given an imitation learning dataset, we sample N candidate actions per state from a generalist robot policy, and apply clustering to reduce them to K representative actions. We construct synthetic action comparisons and assign preferences based on the RMSE between each sampled action and the ground-truth action. This synthetic preference dataset is then used to fine-tune a VLM-based action verifier.

Stage 2: Scaling Test-Time Compute: At deployment, we sample N̂ initial actions from the generalist robot policy based on the given task instruction and observation. We fit a Gaussian distribution to the translation and rotation components of these actions, and use majority voting to determine the gripper state. This creates an action proposal distribution from which we can efficiently sample candidate actions with negligible overhead. Finally, we use the fine-tuned VLM-based verifier to evaluate these K̂ candidate actions and select the optimal action.

Example tasks across Bridge V2, SIMPLER, and LIBERO.

Scaling test-time compute leads to substantial improvements on OOD generalization tasks, achieving a 25% absolute improvement

RoboMonkey improves the precision of generalist robot policies in the SIMPLER environment, leading to 8% higher average success rate on in-distribution tasks

Fine-tuning both OpenVLA and RoboMonkey action verifier results in 7% improvement in average success rate compared to simply fine-tuning OpenVLA on LIBERO-Long

Repeated sampling can exploit KV Cache optimizations and batch processing to achieve higher throughput than greedy decoding. Therefore, we extended SGLang’s capabilities to properly support OpenVLA. Our optimized implementation substantially outperforms the naive OpenVLA inference pipeline, achieving lower latency and significantly higher throughput across batch sizes.

Gaussian perturbation applied to a small set of actions is more efficient than naively sampling actions from robot policies when constructing action proposal distributions. RoboMonkey can sample and verify 16 candidate actions in 650 ms (or 1.5 Hz).

Average success rates across four tasks on SIMPLER as a function of synthetic dataset size. Scaling the dataset size (number of synthetic action comparisons) consistently improves the performance of the RoboMonkey verifier, leading to higher closed-loop success rates.

BibTeX

@article{kwok25robomonkey,
    title={RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models},
    author={Jacky Kwok and Christopher Agia and Rohan Sinha and Matt Foutter and Shulu Li and Ion Stoica and Azalia Mirhoseini and Marco Pavone},
    journal={arXiv preprint arXiv:2506.17811},
    year={2025},
}

RoboMonkey

Scaling Test-Time Sampling and Verification
for Vision-Language-Action Models

Abstract

Inference-time Scaling Law

Approach

Experiments

① Bridge V2

② SIMPLER

③ LIBERO-LONG

Real-World Case Studies

Imprecise Grasping

Task Progression Failure

Collision

Stall in Place

How does RoboMonkey enable practical deployment for test-time scaling?

① VLA Serving Engine

② Gaussian Perturbation

How does scaling the synthetic training dataset impact downstream success rate?

BibTeX

RoboMonkey

Scaling Test-Time Sampling and Verificationfor Vision-Language-Action Models

Abstract

Inference-time Scaling Law

Approach

Experiments

① Bridge V2

② SIMPLER

③ LIBERO-LONG

Real-World Case Studies

Imprecise Grasping

Task Progression Failure

Collision

Stall in Place

How does RoboMonkey enable practical deployment for test-time scaling?

① VLA Serving Engine

② Gaussian Perturbation

How does scaling the synthetic training dataset impact downstream success rate?

BibTeX

Scaling Test-Time Sampling and Verification
for Vision-Language-Action Models