Making humanoid robots truly leave the laboratory has always been the most difficult challenge in this field.
Robots in simulations often move smoothly and execute accurately, but once they enter the real world, many seemingly powerful methods quickly fail due to environmental differences. A slight change in ground friction, increased body load, higher sensor noise, or even just a gentle push from a person can cause the robot to become stiff, unstable, or even collapse directly. In recent years, researchers have been increasingly concerned about whether robots can maintain stability, naturalness, and reliability in real-world scenarios without relying on fine rules or expensive data.
Recently, CMU and Meta research teams jointly proposed a paper with Tsinghua Yao Ban and Li Yitang as the first author, which has attracted widespread attention. This research project attempts to train robots in a more unified and naive way, allowing the model to accumulate experience through unsupervised interaction in large-scale simulations, and then compress different forms of task prompts such as rewards, poses, and action sequences into the same latent space.
Through this design, robots do not need to be repeatedly trained for each task. As long as appropriate latent vectors are generated, they can perform actions with zero samples in the real environment and quickly recover stable performance in the face of disturbances or changes in conditions.
The highlight of this work is not a single technique, but rather its ability to present a natural coherence in the performance of robots in the real world for the first time. For example, it can respond to pushing and shoving like a human, roll back up from a fall, follow instructions even in noisy motion sequences, and recover stable movements solely through latent space search when sudden changes in load or friction occur. Compared to traditional methods that require a large number of rules, scripts, and specialized training tasks, this approach appears more direct and versatile.
BFM-Zero, Make humanoid robots no longer rely on high-quality motion capture data
Paper address: https://arxiv.org/pdf/2511.04131
Cross domain capability from simulation to reality
The experimental results of the paper can be divided into three parts: zero sample testing in a simulation environment, zero sample deployment on a real robot, and rapid adaptation using very little data in special situations. Overall, these experiments collectively demonstrate the generalization ability, robustness, and scalability of BFM Zero.
In the simulation phase, researchers mainly use two physical simulation environments, Isaac and Mujoco, to comprehensively test the model. The physical characteristics of these two environments differ greatly, so it can effectively verify whether the strategy depends on a specific physical setting.
The experimental tasks include three categories: action tracking, target pose arrival, and reward driven behavior generation. In terms of motion tracking, after adding a large amount of physical randomization in the Isaac environment, although the model is not as accurate as ideal, the error only slightly increases, which is an acceptable small change.
When the model is directly placed into Mujoco with significantly different physical laws, its performance remains at a stable level, with performance degradation controlled within 7%. This indicates that the model has learned not a "skill" of a certain environment, but a universal motion law.
BFM-Zero, Make humanoid robots no longer rely on high-quality motion capture data
In reward optimization tasks, researchers have the model automatically infer the behavior that should be performed based on different reward definitions without specific training. The difficulty of such tasks is that the rewards are often sparse and the goals are diverse.
For example, some rewards require robots to move in a certain direction at a specified speed, but due to physical randomization, the state distribution becomes complex, and some tasks may exhibit significant fluctuations, and even perform poorly in some cases.
This is not the degradation of the model itself, but the random sampling that relies on the replay buffer when inferring rewards, combined with physical perturbations to make the data more dispersed. This phenomenon precisely proves that the model is indeed facing complex and changing conditions, rather than taking shortcuts in a "clean environment".
BFM-Zero, Make humanoid robots no longer rely on high-quality motion capture data
For the task of reaching the target pose, the model performs more robustly. Regardless of whether the target pose has appeared in the training data or not, it can smoothly approach the target without exhibiting abnormal behaviors such as violent shaking or jumping. More importantly, even when taking poses from completely different action libraries such as AMASS, the model can successfully complete it, indicating that its potential space can not only cover the training data but also extend beyond the data.
Researchers even directly took action segments from AMASS for the model to follow. The styles of these actions may be far different from the LAFAN1 data used for training, but the model can still execute them, indicating that the latent space has mapped these actions to the same "controllable behavior area", and style differences are no longer a barrier.
When the model is deployed on a real Unitree G1 humanoid, its zero sample capability becomes more intuitive and impressive. In motion tracking tasks, robots can not only walk and turn, but also perform complex dance movements, sports movements, and even combat postures.
More importantly, when it becomes unstable, it will not become stiff or collapse like traditional robots, but will make natural adjustments like humans, such as center of gravity shift, ground support, rolling buffering, etc., and then stand up again to continue the task.
BFM-Zero, Make humanoid robots no longer rely on high-quality motion capture data
This natural recovery action comes entirely from the structured latent space and style constraints of the strategy itself, rather than training skills such as' fall recovery 'separately. Even when the actions used for tracking are estimated from monocular videos with poor quality, it can still smoothly follow, indicating that the model has strong fault tolerance for input quality.
In the target pose arrival task, the researchers randomly sampled a large number of target poses and required the robot to arrive one by one in order. The robot's movements are very smooth when switching between poses, without the need for human interpolation or transition actions, indicating that its internal latent space has natural continuity. If some postures cannot be accurately achieved in reality (such as joint angles exceeding the limit), the robot will automatically find a posture that is closest, natural, and safe, rather than forcing imitation to cause falls or convulsions.
In reward optimization tasks, researchers use various reward signals to automatically generate corresponding behaviors for robots. For example, if you ask it to lower its pelvic height, it will sit down or squat up; Reward hand height, it will raise its hand; Reward speed, it will move or turn. These different rewards can also be combined, such as allowing it to move backwards while raising its hand.
This composability means that in the future, by describing requirements through language and parsing the language into rewards, robots can automatically "understand" what to do. What's even more interesting is that under the same reward, the potential expressions generated by different replay buffer subsamples will be slightly different, resulting in different styles of actions. This indicates that the strategy space itself is multimodal, with multiple feasible solutions rather than a rigid optimal action.
BFM-Zero, Make humanoid robots no longer rely on high-quality motion capture data
When facing huge external disturbances in real environments, robots exhibit extremely high flexibility and stability. When pushed, kicked, or pulled down, it will not simply resist stiffly, but will absorb the impact in a gentle way, such as taking a few steps back to cushion the center of gravity, adjusting arm posture to maintain balance, etc.
Even if it falls completely to the ground, it can still climb up through natural and smooth movements and return to its original task, such as continuing to recover its standing or target posture. These recovery actions are not rigidly written, but rather strategies naturally expressed in the latent space, making the robot appear more "human like".
Finally, the researchers demonstrated the model's ability to adapt quickly. During the adaptation process, there is no need to adjust the network weights, only to optimize the latent vectors for new situations. The first adaptation case is to add a four kilogram load on the robot torso. Originally, zero sample latency was not sufficient to support single leg standing, but after twenty iterations of cross entropy optimization, a new latent vector can be found, allowing the robot to stand stably for more than fifteen seconds under load, and the optimization results can be successfully transferred directly to real robots.
The second case is the unstable jumping trajectory caused by frictional changes. Researchers optimized the latent vector sequence through dual annealing and sampling methods, ultimately reducing trajectory errors by nearly 30% and making the overall action more stable. This process does not rely on retraining the model, but entirely on the flexibility of the latent space.
BFM-Zero, Make humanoid robots no longer rely on high-quality motion capture data
A three-step framework leading to a universal behavioral model
Overall, the experimental process of this study can be divided into three stages: unsupervised pre training, zero sample inference, and small sample adaptation.
Researchers hope to enable robots to understand tasks, generate actions, and maintain stable performance even under changing conditions without relying on multiple training methods when facing different types of tasks. This design not only makes the robot more unified during the training phase, but also makes the subsequent actual deployment more flexible.
In the unsupervised pre training phase, the model needs to accumulate experience through interaction with a large number of simulation environments without explicit task rewards. In order to enable robots to cope with various types of tasks, researchers have constructed a unified latent space that maps all information such as rewards, target poses, and action sequences to the same latent representation. Leifeng Net
The construction of this latent space relies on the forward backward method, which allows robots to infer corresponding latent vectors by observing their own trajectories or task prompts. In order to provide the model with a sufficiently broad empirical foundation, 1024 parallel Isaac physics simulation environments were used during the training process. These environments operate at high frequencies, simulating the dynamics of joints throughout the body, the frictional characteristics of ground contact, and the changes in gravity. Throughout the entire training process, the model has accumulated over five million interaction samples, forming a comprehensive behavioral experience database.
In addition to extensive environmental experience, the training process also introduces rich physical randomization. Researchers will randomly change the mass distribution of various parts of the robot, adjust the friction coefficient of the ground, apply random external forces, change the initial state of the body posture, and add sensor noise during the simulation process.
These randomized settings approximate the uncertainty of the real world, ensuring that the trained strategies do not collapse in real-world deployment due to slight differences between the environment and simulation. At the same time, in order to make robot actions more in line with human characteristics, researchers also introduced action datasets as style references, and used style discriminators to preserve the structure of natural actions when generating actions. For example, the swinging of the arms and changes in the center of gravity of the body will appear more in line with human movements due to style constraints.
In order to avoid learning potentially dangerous actions from the strategy, hardware related safety constraints are also added during training. For example, limiting the range of joint angles, preventing strange collisions with the ground, and limiting excessive body displacement. These auxiliary rewards ensure that the model does not lean towards action patterns that are effective but unsafe in the vast training space, and also guarantee that it will not damage the robot hardware in future real-world experiments.
BFM-Zero, Make humanoid robots no longer rely on high-quality motion capture data
In the zero sample inference stage, the model already has the ability to interpret different task prompts, so there is no need to continue training its network structure. When it receives a new task, it only needs to generate a corresponding latent vector z based on the task type. This vector can clearly express the task requirements, and the policy network can generate corresponding actions based on it.
If the task is based on rewards, the latent vector will be inferred from the experience of the reply buffer through the relationship between the reward signal and the backward embedding. If the task is attitude arrival, the researcher directly inputs the target state into the backward embedding to generate latent vectors. In the action tracking task, the model embeds the target actions of the next few time steps into the latent space, generates a continuous latent vector sequence, and gradually executes it.
In terms of effectiveness, this means that the robot does not need to be retrained for each task, as long as it can generate appropriate latent vectors, it can directly perform actions, move to target positions, or adjust behavior based on rewards.
In the stage of adapting to a small number of samples, the model faces new conditions that were not encountered during training, such as sudden increases in load, unpredictable changes in ground friction coefficient dynamics, and so on. In order to enable robots to quickly recover performance in reality, researchers do not modify the network itself, but search for vectors in the latent space that are more suitable for new conditions.
Due to the strong expressive power of latent space, as long as suitable vectors are found, the robot can regain stable performance. In a single pose task, researchers use cross entropy optimization method to gradually find the optimal solution by constantly trying different latent vectors and evaluating their performance.
In dynamic trajectory tasks, a dual annealing strategy of sampling is used to search for latent vector sequences through continuous perturbation and convergence, in order to re stabilize the robot's motion trajectory. Because this adaptation process does not require a large amount of data, has low cost, and converges quickly, it is very suitable for the rapid adjustment needs in real-world scenarios.
Overall, these three stages together constitute the complete path of model training and deployment: from learning general action structures in diverse environments, to executing them directly without training in practical tasks, and then using a small amount of data for fine-tuning when encountering special situations, enabling robots to exhibit good generalization and adaptability in complex environments.
BFM-Zero, Make humanoid robots no longer rely on high-quality motion capture data
The key step towards universalization
The significance of this research lies in multiple aspects and plays an important role in promoting the development of humanoid robots in the future.
Firstly, it demonstrates that unsupervised reinforcement learning can also achieve results on real humanoid robots. In the past, successful approaches to enabling humanoid robots to perform complex actions mostly relied on a large amount of imitation data or carefully designed task rewards. However, this work proves that even without clear rewards or finely labeled action trajectories, robots can still form generalizable behavioral abilities through exploration and style learning in large-scale simulations. This shows people that humanoid robots do not necessarily require expensive data costs and can learn stable and rich motor skills.
Secondly, the actions generated by this method show significant improvements in naturalness and flexibility. Traditional humanoid robots often exhibit very stiff behavior when facing external forces, and can only make rigid support movements. Once the direction of the external force changes slightly, they may become unstable. The strategy trained by this method will exhibit more coherent and smooth responses when encountering disturbances, such as slightly adjusting the center of gravity, changing the pace, and naturally stabilizing the body.
Even when subjected to significant pushing and shoving, the robot is able to handle it in a gentle yet not abrupt manner, which is closer to the stability mechanism of human movements. This indicates that the motion patterns learned by the model in the latent space have inherent coordination, rather than simple mechanical correction.
Furthermore, this method lays the foundation for the future construction of humanoid robots that can be prompted to control and understand the intent of generalized tasks. Since all behaviors are uniformly mapped to latent space, robots can rely on latent vector combinations and adjust their behaviors.
In the future, robots will only need high-level task descriptions such as target posture, overall intention, or reward preferences to automatically organize corresponding actions without the need to retrain specialized strategies for each task. This design takes a step towards a 'behavioral level foundational model', making robots easier to scale, easier to control, and closer to the goal of general intelligence.
At the same time, this method has strong adaptability to reality. Add a large amount of randomization in training to ensure that the strategy remains stable even under different dynamic conditions. In real environments, when the load changes, ground friction varies, or action requirements suddenly change, robots do not need to be retrained. They only need to make slight adjustments in the latent space to quickly restore reliable performance. This significantly improves the usability of the model in real-world environments, enabling it to better cope with complex and changing physical conditions.
Finally, this study freed itself from the reliance on high-quality motion capture data. In the past, in order to make robot movements look natural, it required the use of professional equipment to collect a large amount of high-precision human motion data, which was extremely expensive. The unlabeled action sequence used here is sufficient for the model to learn the overall style of human movements, reducing the difficulty of data collection and making training more flexible.
Overall, this work not only provides a highly consistent training method between simulation and reality, but also constructs a potential behavioral space with generalization, naturalness, stability, and adaptability, laying the foundation for more intelligent and universal humanoid robots in the future.
GAIR 2025, Let technology 'go out' of papers
On December 12-13, 2025, the 8th GAIR Global Artificial Intelligence and Robotics Conference will be held at the Sheraton Shenzhen Nanshan · Bolin Tianrui Hotel.
The world model is the "cognitive core" for embodied intelligence to understand and transform the world. At the GAIR Conference World Model Sub Forum, we have invited several renowned scholars from top universities and research institutions at home and abroad to publish multiple themed reports on the exploration and breakthroughs of world models and spatial intelligence in the field of embodied robots, jointly exploring the latest developments in this real-world application.
In the roundtable dialogue segment of the forum, scholars will conduct in-depth discussions on key topics such as "how world models can bridge the gap between simulation and reality". At that time, top R&D teams from the industry will also share their successful practice of applying cutting-edge theories of world models to robot entities and solving complex scenario tasks.
We look forward to witnessing with you how the world model injects true "soul" into embodied intelligence, opening a new chapter in autonomous decision-making and action for robots.
|