HumanoidArena

Simulation-first benchmark for humanoid control

HumanoidArena:
Benchmarking Egocentric Hierarchical Whole-body Learning.

Taowen Wang1 Zikang Xie1 Bin Yang1 Yunheng Wang1 Zizhao Yuan1 Yuetong Fang1 Yixiao Feng1 Yichi Wang2 Xingyu Chen1 Haodong Chen3 Qiwei Wu1 Weisheng Xu1 Lihan Chen4 Lusong Li5 Zecui Zeng5 Renjing Xu1

1 The Hong Kong University of Science and Technology (Guangzhou) 2 Beijing University of Technology 3 Harbin Institute of Technology, Shenzhen 4 Shenzhen MSU-BIT University 5 JD Explore Academy

HumanoidArena is a simulation-first benchmark for egocentric hierarchical whole-body learning. It studies seven leg-critical HOI/HSI tasks where success depends on coordinated perception, foot placement, balance, posture adjustment, and whole-body motion, while exposing both perturbation-conditioned generalization and GMT-conditioned transfer.

Task suite
7 leg-critical HOI/HSI tasks
Evaluation axes
Visual, Semantic, Execution, Cross-GMT
Core stack
TWIST2, IsaacLab, LeRobot

Overview figure

HumanoidArena overview figure

Abstract

Paper abstract.

Humanoid robots promise whole-body interaction in human-centered environments, but scalable policy learning remains difficult because task-level decision-making and whole-body dynamic execution are tightly coupled. A practical solution is hierarchical control, where a high-level policy predicts intermediate whole-body actions and low-level general motion trackers (GMTs) execute them as stable humanoid motion. However, existing benchmarks rarely evaluate the policy-tracker interface itself, leaving open whether intermediate whole-body actions are executable, robust under task distribution shifts, and transferable across different GMT backends. We introduce HumanoidArena, a simulation-first benchmark for egocentric hierarchical whole-body learning. The benchmark formulates policy learning as a hierarchical decision making problem: a high-level policy converts egocentric vision, proprioception, and instructions into a compact whole-body action, which is subsequently executed by a low-level GMT. Instead of treating the legs as planar transport tools, HumanoidArena emphasizes interactions where lower-body coordination is structurally necessary in task completion. We therefore design 7 leg-critical HOI/HSI tasks in which success requires foot placement, balance maintenance, posture adjustment, and whole-body reorientation. To further diagnose the hierarchical system, we evaluate policies from two complementary perspectives: perturbation-conditioned generalization and GMT-conditioned transfer. We benchmark representative imitation-learning and VLA-style policies under this shared interface. Experiments show that hierarchical control enables learned policies to solve diverse leg-critical interactions, but performance is strongly tracker-conditioned and cross-GMT transfer remains fragile. These results position HumanoidArena as a benchmark for studying transferable intermediate action representations and scalable egocentric whole-body policy learning.

Task gallery

Seven tasks across both GMTs.

HOI

Football

Leg-object interaction with ball approach, kick timing, and balance-aware whole-body control.

TWIST2
SONIC

HOI

DoubleDesk

Cross-surface object transfer with stepping, reach planning, and whole-body reorientation.

TWIST2
SONIC

HOI

P&PBox

Leg-assisted high placement requiring crouch, posture change, and shelf-height interaction.

TWIST2
SONIC

HSI

OpenDoor

Handle manipulation, body turning, and doorway traversal under egocentric perception.

TWIST2
SONIC

HSI

SitSofa

Obstacle-aware navigation and stable sitting transition with lower-body alignment constraints.

TWIST2
SONIC

HSI

Boxing

Height-adaptive striking that requires crouch control and coordinated whole-body adjustment.

TWIST2
SONIC

HSI

VisNavi

Obstacle-aware visual navigation in constrained scenes using an egocentric camera stream.

TWIST2
SONIC

Pipeline

From teleop to benchmark.

1. Shared capture and retargeting

The operator receives the humanoid's egocentric video stream through a PICO headset. Human motion is captured and retargeted through GMR into a shared 35D robot-space reference signal for real-time consumption.

Third-person
Sim ego view

2. Backend-specific action interpretation

Inside Isaac Lab, backend-specific action providers interpret that shared signal into executable G1 targets using either TWIST2 or SONIC, exposing different low-level tracking dynamics under one common upstream interface.

3. Recording and canonicalization

A recording manager serializes episodes into NPZ archives with egocentric observations, state, action, and replayable trajectory data. Demonstrations are normalized into a 64D canonical state and 40D intermediate whole-body action.

Left wrist
Main ego
Right wrist

4. Conversion, training, evaluation

Raw recordings are converted into LeRobot-compatible datasets, then used to train high-level policies that are evaluated under in-GMT, cross-GMT, visual, semantic, and execution protocols.

01

Capture

PICO egocentric stream and human motion capture feed the shared teleop loop.

02

Retarget

GMR produces a robot-space reference signal with shared policy-facing semantics.

03

Execute

TWIST2 or SONIC interprets and tracks the action through its own backend logic.

04

Record

Isaac Lab stores NPZ episodes with synchronized observations, state, action, and video.

05

Convert

Conversion tools normalize recordings into LeRobot-compatible datasets.

06

Benchmark

Policies are trained and stress-tested across perturbation and GMT-conditioned evaluation.

Evaluation protocol

Four tests plus cross-GMT.

Base

Reference evaluation without additional perturbation, used as the default task performance baseline.

Semantic

Measures sensitivity to higher-level task or environment semantic variation.

Vision

Targets robustness to visual observation shift and perception-side difficulty.

Execution

Focuses on control-side degradation and policy-to-tracker execution mismatch.

Cross-GMT deployment

The homepage reserves prominent space for results showing how policies transfer across different GMT backends. This is currently a content placeholder waiting for plots or summary tables.

Example task

Football makes the perturbation axes concrete.

Base

Default task setup with the original ball placement range, scene assets, and nominal lighting.

Vision

Lighting direction changes while the task and target identity remain fixed, stressing appearance robustness.

Semantic

Semantics-preserving asset changes, such as the football goal appearance, probe whether the policy grounds the right target.

Execution

The ball initialization range is expanded, forcing different approach geometry, foot placement, and kick timing.

Cross-GMT

Keep the trained high-level policy fixed but swap the low-level execution backend, measuring whether the same intermediate whole-body actions stay valid under TWIST2 and SONIC.

Results

Success, failure, recovery.

Scenario 1

P&PBox inference

Show one successful rollout and one failure case to compare posture control, shelf approach, and task completion.

Success
Timeout

Scenario 2

Football inference

Contrast a successful kick sequence with a failure sequence to surface contact timing and execution sensitivity.

Success
Failure

Scenario 3

OpenDoor inference

Compare successful door traversal against a failed or incomplete attempt that breaks at handle use or body alignment.

Success
Timeout
* Videos shown here are sampled at 1 FPS for storage and efficiency considerations. Inference uses the native video stream at the same frame rate as training.

Resources

Paper, code, and more.

Code

Codebase, training pipeline, and evaluation scripts.

Coming soon

Dataset

LeRobot-compatible egocentric demonstration dataset.

Coming soon

BibTeX

Citation snippet and author metadata for the paper.

Coming soon