Humanoid robots promise whole-body interaction in human-centered environments, but scalable policy learning remains difficult because task-level decision-making and whole-body dynamic execution are tightly coupled. A practical solution is hierarchical control, where a high-level policy predicts intermediate whole-body actions and low-level general motion trackers (GMTs) execute them as stable humanoid motion. However, existing benchmarks rarely evaluate the policy-tracker interface itself, leaving open whether intermediate whole-body actions are executable, robust under task distribution shifts, and transferable across different GMT backends. We introduce HumanoidArena, a simulation-first benchmark for egocentric hierarchical whole-body learning. The benchmark formulates policy learning as a hierarchical decision making problem: a high-level policy converts egocentric vision, proprioception, and instructions into a compact whole-body action, which is subsequently executed by a low-level GMT. Instead of treating the legs as planar transport tools, HumanoidArena emphasizes interactions where lower-body coordination is structurally necessary in task completion. We therefore design 7 leg-critical HOI/HSI tasks in which success requires foot placement, balance maintenance, posture adjustment, and whole-body reorientation. To further diagnose the hierarchical system, we evaluate policies from two complementary perspectives: perturbation-conditioned generalization and GMT-conditioned transfer. We benchmark representative imitation-learning and VLA-style policies under this shared interface. Experiments show that hierarchical control enables learned policies to solve diverse leg-critical interactions, but performance is strongly tracker-conditioned and cross-GMT transfer remains fragile. These results position HumanoidArena as a benchmark for studying transferable intermediate action representations and scalable egocentric whole-body policy learning.
Simulation-first benchmark for humanoid control
HumanoidArena:
Benchmarking Egocentric Hierarchical Whole-body Learning.
HumanoidArena is a simulation-first benchmark for egocentric hierarchical whole-body learning. It studies seven leg-critical HOI/HSI tasks where success depends on coordinated perception, foot placement, balance, posture adjustment, and whole-body motion, while exposing both perturbation-conditioned generalization and GMT-conditioned transfer.
- Task suite
- 7 leg-critical HOI/HSI tasks
- Evaluation axes
- Visual, Semantic, Execution, Cross-GMT
- Core stack
- TWIST2, IsaacLab, LeRobot
Overview figure
Abstract
Paper abstract.
Task gallery
Seven tasks across both GMTs.
HOI
Football
Leg-object interaction with ball approach, kick timing, and balance-aware whole-body control.
HOI
DoubleDesk
Cross-surface object transfer with stepping, reach planning, and whole-body reorientation.
HOI
P&PBox
Leg-assisted high placement requiring crouch, posture change, and shelf-height interaction.
HSI
OpenDoor
Handle manipulation, body turning, and doorway traversal under egocentric perception.
HSI
SitSofa
Obstacle-aware navigation and stable sitting transition with lower-body alignment constraints.
HSI
Boxing
Height-adaptive striking that requires crouch control and coordinated whole-body adjustment.
HSI
VisNavi
Obstacle-aware visual navigation in constrained scenes using an egocentric camera stream.
Pipeline
From teleop to benchmark.
1. Shared capture and retargeting
The operator receives the humanoid's egocentric video stream through a PICO headset. Human motion is captured and retargeted through GMR into a shared 35D robot-space reference signal for real-time consumption.
2. Backend-specific action interpretation
Inside Isaac Lab, backend-specific action providers interpret that shared signal into executable G1 targets using either TWIST2 or SONIC, exposing different low-level tracking dynamics under one common upstream interface.
3. Recording and canonicalization
A recording manager serializes episodes into NPZ archives with egocentric observations, state, action, and replayable trajectory data. Demonstrations are normalized into a 64D canonical state and 40D intermediate whole-body action.
4. Conversion, training, evaluation
Raw recordings are converted into LeRobot-compatible datasets, then used to train high-level policies that are evaluated under in-GMT, cross-GMT, visual, semantic, and execution protocols.
Capture
PICO egocentric stream and human motion capture feed the shared teleop loop.
Retarget
GMR produces a robot-space reference signal with shared policy-facing semantics.
Execute
TWIST2 or SONIC interprets and tracks the action through its own backend logic.
Record
Isaac Lab stores NPZ episodes with synchronized observations, state, action, and video.
Convert
Conversion tools normalize recordings into LeRobot-compatible datasets.
Benchmark
Policies are trained and stress-tested across perturbation and GMT-conditioned evaluation.
Evaluation protocol
Four tests plus cross-GMT.
Base
Reference evaluation without additional perturbation, used as the default task performance baseline.
Semantic
Measures sensitivity to higher-level task or environment semantic variation.
Vision
Targets robustness to visual observation shift and perception-side difficulty.
Execution
Focuses on control-side degradation and policy-to-tracker execution mismatch.
Cross-GMT deployment
The homepage reserves prominent space for results showing how policies transfer across different GMT backends. This is currently a content placeholder waiting for plots or summary tables.
Example task
Football makes the perturbation axes concrete.
Base
Default task setup with the original ball placement range, scene assets, and nominal lighting.
Vision
Lighting direction changes while the task and target identity remain fixed, stressing appearance robustness.
Semantic
Semantics-preserving asset changes, such as the football goal appearance, probe whether the policy grounds the right target.
Execution
The ball initialization range is expanded, forcing different approach geometry, foot placement, and kick timing.
Cross-GMT
Keep the trained high-level policy fixed but swap the low-level execution backend, measuring whether the same intermediate whole-body actions stay valid under TWIST2 and SONIC.
Results
Success, failure, recovery.
Scenario 1
P&PBox inference
Show one successful rollout and one failure case to compare posture control, shelf approach, and task completion.
Scenario 2
Football inference
Contrast a successful kick sequence with a failure sequence to surface contact timing and execution sensitivity.
Scenario 3
OpenDoor inference
Compare successful door traversal against a failed or incomplete attempt that breaks at handle use or body alignment.
Resources
Paper, code, and more.
Code
Codebase, training pipeline, and evaluation scripts.
Coming soonDataset
LeRobot-compatible egocentric demonstration dataset.
Coming soonBibTeX
Citation snippet and author metadata for the paper.
Coming soon