Texedoperson in tuxedo: Test-Time Scaling for Controller-aware Language-conditioned Humanoid Motion Generation

Jianuo Cao*1,2, Yuxin Chen*2, Yuzhen Song2,3, Masayoshi Tomizuka2, Chenran Li2, Ran(Thomas) Tian2
1Nanjing University    2University of California, Berkeley    3Southern University of Science and Technology
Abstract

Text-conditioned motion generation has become a promising interface for programming humanoid robots, but current generators are often trained on human motion datasets retargeted to robot morphology. While such data provides rich semantic and kinematic priors, it does not capture the nuances of the whole-body tracking controller, including balance, contact, actuation limits, and controller-specific failure modes. Therefore, generated motions can be semantically plausible yet difficult or impossible for the robot to execute. We propose TEXEDO, a test-time scaling framework for humanoid motion generation that improves motion quality without requiring a stronger underlying generator. Given a text prompt, our method samples multiple motions from the pre-trained text-conditioned generator and selects the best executable and task-aligned motion. The reward model combines a dynamic feasibility verifier, distilled from whole-body tracking rollouts to predict physical executability, with a semantic alignment verifier that measures text-motion alignment in a learned co-embedding space. Our pipeline treats dynamic feasibility as a hard constraint and semantic alignment as the selection objective within the feasible set. Across large-scale simulation studies and real-world deployment on a Unitree G1, we show that our test-time scaling strategy consistently improves both tracking fidelity and text alignment, demonstrating that grounded verification is an effective path toward deployable language-guided humanoid motion generation.

TEXEDO pipeline overview
Overview Video
Interactive Demo
Text choose a prompt
See two grounded scores per candidate
Sample candidates to begin.
Results
Fidelity validation for verifiers
Dual verifier design
Dual verifier design. Rdyn estimates dynamic feasibility from the motion alone and Rtext measures text–motion alignment in a learned embedding space.
Best-of-N curves
Best-of-N curves for dynamic feasibility and semantic alignment.
TEXEDO balances the complementary strengths of Rdyn and Rtext.
StrategyVLM-Judge ↑Succ ↑EmpjpeEaccEvelQ* ↑
Base (N=1)5.7220.87344.346.0911.820.829
Rdyn-only (N=32)4.9240.99038.153.305.910.945
Rtext-only (N=32)6.1100.88542.974.9010.030.847
TEXEDO (N=32)6.0540.98439.094.267.780.926
Zero-shot transfer to Kimodo.
MethodVLM-Judge ↑Succ ↑EmpjpeEaccEvelQ* ↑
Kimodo (Base, N=1)4.8230.93738.621.845.330.918
Kimodo + Rdyn-only (N=32)5.0200.95535.471.424.280.937
Kimodo + Rtext-only (N=32)5.7170.94238.021.875.430.919
Kimodo + TEXEDO (N=32)5.3810.95437.491.704.970.935
Out-of-distribution generalization on BONES-SEED prompts.
Setting Method VLM-Judge ↑ Succ ↑ Empjpe Eacc Evel Q* ↑
ID Base (N=1) 5.7220.87344.346.0911.820.829
TEXEDO (N=32) 6.0540.98439.094.267.780.926
OOD Base (N=1) 4.1160.86036.228.3615.710.804
TEXEDO (N=32) 4.4010.98026.643.653.880.942
Real-Robot Deployment
Citation
@misc{cao2026texedotesttime,
      title={TEXEDO : Test Time Scaling for Controller-aware Language-conditioned Humanoid Motion Generation},
      author={Jianuo Cao and Yuxin Chen and Yuzhen Song and Masayoshi Tomizuka and Chenran Li and Thomas Tian},
      year={2026},
      eprint={2606.22998},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.22998},
}