Simulation-first benchmark for humanoid control

HumanoidArena:
Benchmarking Egocentric Hierarchical Whole-body Learning.

Taowen Wang^1,* Zikang Xie^1,* Bin Yang^1,* Yunheng Wang¹ Zizhao Yuan¹ Yuetong Fang¹ Yixiao Feng¹ Yichi Wang² Xingyu Chen¹ Haodong Chen³ Qiwei Wu¹ Weisheng Xu¹ Lihan Chen⁴ Lusong Li⁵ Zecui Zeng⁵ Renjing Xu¹

¹ The Hong Kong University of Science and Technology (Guangzhou) ² Beijing University of Technology ³ Harbin Institute of Technology, Shenzhen ⁴ Shenzhen MSU-BIT University ⁵ JD Explore Academy ^* Equal contribution

HumanoidArena is a simulation-first benchmark for egocentric hierarchical whole-body learning. It studies seven leg-critical HOI/HSI tasks where success depends on coordinated perception, foot placement, balance, posture adjustment, and whole-body motion, while exposing both perturbation-conditioned generalization and GMT-conditioned transfer.

Code Paper Dataset Task Videos Results

Task suite: 7 leg-critical HOI/HSI tasks
Evaluation axes: Visual, Semantic, Execution, Cross-GMT
Core stack: TWIST2, SONIC, IsaacLab, LeRobot

Overview figure

Abstract

Paper abstract.

Humanoid robots promise whole-body interaction in human-centered environments, but scalable policy learning remains difficult because task-level decision-making and whole-body dynamic execution are tightly coupled. A practical solution is hierarchical control, where a high-level policy predicts intermediate whole-body actions and low-level general motion trackers (GMTs) execute them as stable humanoid motion. However, existing benchmarks rarely evaluate the policy-tracker interface itself, leaving open whether intermediate whole-body actions are executable, robust under task distribution shifts, and transferable across different GMT backends. We introduce HumanoidArena, a simulation-first benchmark for egocentric hierarchical whole-body learning. The benchmark formulates policy learning as a hierarchical decision making problem: a high-level policy converts egocentric vision, proprioception, and instructions into a compact whole-body action, which is subsequently executed by a low-level GMT. Instead of treating the legs as planar transport tools, HumanoidArena emphasizes interactions where lower-body coordination is structurally necessary in task completion. We therefore design 7 leg-critical HOI/HSI tasks in which success requires foot placement, balance maintenance, posture adjustment, and whole-body reorientation. To further diagnose the hierarchical system, we evaluate policies from two complementary perspectives: perturbation-conditioned generalization and GMT-conditioned transfer. We benchmark representative imitation-learning and VLA-style policies under this shared interface. Experiments show that hierarchical control enables learned policies to solve diverse leg-critical interactions, but performance is strongly tracker-conditioned and cross-GMT transfer remains fragile. These results position HumanoidArena as a benchmark for studying transferable intermediate action representations and scalable egocentric whole-body policy learning.

SONIC

Pipeline

From teleop to benchmark.

1. Shared capture and retargeting

The operator receives the humanoid's egocentric video stream through a PICO headset. Human motion is captured and retargeted through GMR into a shared 35D robot-space reference signal for real-time consumption.

Third-person

Sim ego view

2. Backend-specific action interpretation

Inside Isaac Lab, backend-specific action providers interpret that shared signal into executable G1 targets using either TWIST2 or SONIC, exposing different low-level tracking dynamics under one common upstream interface.

3. Recording and canonicalization

A recording manager serializes episodes into NPZ archives with egocentric observations, state, action, and replayable trajectory data. Demonstrations are normalized into a 64D canonical state and 40D intermediate whole-body action.

Left wrist

Main ego

Right wrist

4. Conversion, training, evaluation

Raw recordings are converted into LeRobot-compatible datasets, then used to train high-level policies that are evaluated under in-GMT, cross-GMT, visual, semantic, and execution protocols.

SONIC inference additionally provides an experimental latent-output mode for VLA policies trained on SONIC encoder representations. In this mode, the policy outputs 64D SONIC latent actions instead of the default 40D semantic whole-body commands. The raw dataset also preserves SONIC latent64 representations for each demonstration episode.

Capture

PICO egocentric stream and human motion capture feed the shared teleop loop.

Retarget

GMR produces a robot-space reference signal with shared policy-facing semantics.

Execute

TWIST2 or SONIC interprets and tracks the action through its own backend logic.

Record

Isaac Lab stores NPZ episodes with synchronized observations, state, action, and video.

Convert

Conversion tools normalize recordings into LeRobot-compatible datasets.

Benchmark

Policies are trained and stress-tested across perturbation and GMT-conditioned evaluation.

Evaluation protocol

Four tests plus cross-GMT.

Base

Reference evaluation without additional perturbation, used as the default task performance baseline.

Semantic

Measures sensitivity to higher-level task or environment semantic variation.

Vision

Targets robustness to visual observation shift and perception-side difficulty.

Execution

Focuses on control-side degradation and policy-to-tracker execution mismatch.

Cross-GMT deployment

The homepage reserves prominent space for results showing how policies transfer across different GMT backends. This is currently a content placeholder waiting for plots or summary tables.

Example task

P&PBox makes the perturbation axes concrete.

Execution Adaptive

Base

Execution

Compared with the base setting: The shelf placement range along the x-axis is expanded, introducing larger spatial variations.

Semantic Adaptive

Base

Semantic

Compared with the base setting: A visually similar distractor object is added beside the target box, evaluating target understanding beyond appearance matching.

Vision Adaptive

Base

Vision case 1

Vision case 2

Compared with the base setting: The lighting direction is randomly changed while keeping the task layout unchanged, evaluating robustness to visual appearance variations.

Timeout

* Videos shown here are sampled at 1 FPS for storage and efficiency considerations. Inference uses the native video stream at the same frame rate as training.

Resources

Paper, code, and more.

Paper

arXiv preprint, 2026.

arXiv

Code

Codebase, training pipeline, and evaluation scripts.

GitHub

Dataset

LeRobot-compatible egocentric demonstration dataset.

HuggingFace ModelScope

Models

Released policy checkpoints for HumanoidArena evaluation.

HuggingFace ModelScope

Assets

Simulation assets required by the Isaac Lab environments.

Google Drive

Raw Data

Complete raw demonstration data for all tasks.

Complete P&PBox

BibTeX

@article{wang2026humanoidarena, title={HumanoidArena: Benchmarking Egocentric Hierarchical Whole-body Learning}, author={Wang, Taowen and Xie, Zikang and Yang, Bin and others}, journal={arXiv preprint arXiv:2606.17833}, year={2026} }

Contact

Add either WeChat account (xiezhikang2003 and Chanw15) to join the community group.

HumanoidArena:Benchmarking Egocentric Hierarchical Whole-body Learning.

Paper abstract.

Seven tasks across both GMTs.

Football

DoubleDesk

P&PBox

OpenDoor

SitSofa

Boxing

VisNavi

From teleop to benchmark.

1. Shared capture and retargeting

2. Backend-specific action interpretation

3. Recording and canonicalization

4. Conversion, training, evaluation

Capture

Retarget

Execute

Record

Convert

Benchmark

Four tests plus cross-GMT.

Base

Semantic

Vision

Execution

Cross-GMT deployment

P&PBox makes the perturbation axes concrete.

Execution Adaptive

Semantic Adaptive

Vision Adaptive

Success, failure, recovery.

P&PBox inference

Football inference

OpenDoor inference

Paper, code, and more.

Paper

Code

Dataset

Models

Assets

Raw Data

BibTeX

Contact

HumanoidArena:
Benchmarking Egocentric Hierarchical Whole-body Learning.