No-data Imitation
Learning

Teaching robots motor skills using only AI-generated videos; no motion capture, no expert demonstrations, no curated data. Just a text prompt and a physics simulator.

Paper Code

Mert Albaba  ·  Chenhao Li  ·  Markos Diomataris  ·  Omid Taheri  ·  Andreas Krause  ·  Michael J. Black

ETH Zürich  ·  Max Planck Institute for Intelligent Systems

CVPR 2026
NIL Overview

Watch the Explainer

A short walkthrough of how NIL works, the results, and why it matters.

Teaching robots to move
shouldn't require mountains of data

Current approaches to robot learning face fundamental bottlenecks. NIL eliminates them all.

🏋

Reinforcement Learning

Requires painstaking manual reward engineering for each new task and robot. Poorly specified rewards lead to unintended behavior.

📹

Imitation Learning

Needs expensive, high-quality 3D motion-capture data. Challenging to obtain for non-humanoid robots and animals.

NIL: Our Approach

Uses AI-generated videos as the sole source of demonstration. No curated data, just a text prompt.

Key Insight: Video diffusion models generate realistic-looking motion for any morphology. While these videos aren't physically accurate, a physics simulator can enforce plausibility. NIL combines both; the video provides visual guidance, the simulator enforces physical constraints.

Two stages. Zero data.

NIL generates a reference video from a text prompt, then trains a control policy to physically replicate it in simulation.

1

Generate Reference Video

Render the robot's initial frame. Feed it with a task description (e.g., "Robot is walking") into a pretrained video diffusion model. Generate a realistic 2D video of the desired motion.

2

Learn by Imitating the Video

Train a reinforcement learning policy in a physics simulator to match the generated video. The reward compares the simulation rendering against the reference video using video encoders and segmentation masks.

NIL method overview — full pipeline

How does the robot know it's
imitating correctly?

NIL computes a discriminator-free imitation reward by comparing the rendered simulation against the generated video through three complementary signals.

🎬

Video Similarity

A video vision transformer (TimeSFormer) embeds both videos. Cosine similarity between their embeddings provides temporal and semantic guidance.

🎭

Mask IoU

Segment the robot's body in both videos using SAM2. Compute frame-by-frame IoU between masks for precise spatial alignment.

⚖️

Regularization

Penalize excessive joint torques, angular velocities, and action deltas to ensure smooth, stable, physically plausible motion.

Rt = ζ · Svideo + β · Smask + η · Preg
The combined reward at each timestep guides the policy toward natural, accurate motion.

Walking across diverse robots

NIL learns locomotion for four different robots from a single generated video each. Left: AI-generated reference. Right: physically-plausible learned policy.

Generated  Unitree H1

Kling AI — "The H1 robot is walking"

NIL  Unitree H1

Learned natural bipedal gait

Generated  Talos

Kling AI — heavy-duty humanoid

NIL  Talos

Learned walking for complex morphology

Generated  Unitree G1

Kling AI — compact humanoid

NIL  Unitree G1

Learned compact humanoid locomotion

Generated  Unitree A1

Pika — quadruped robot

NIL  Unitree A1

Same method, four-legged gait

NIL vs. AMP (Motion-Capture Baseline)

NIL matches AMP's performance despite using zero motion-capture data, while AMP requires 25 curated trajectories per robot.

NIL  H1 — No curated data

AMP  H1 — 25 MoCap trajectories

NIL  Talos — No curated data

AMP  Talos — 25 MoCap trajectories

NIL  G1 — No curated data

AMP  G1 — 25 MoCap trajectories

NIL  A1 — No curated data

AMP  A1 — 25 MoCap trajectories

Beyond walking: complex skills
from a single video

NIL tackles whole-body manipulation — sitting, hanging, balancing — matching RL baselines that use hand-designed reward functions.

Generated  Sit

Generated  Hang

Generated  Balance

NIL  Sit

NIL  Hang

NIL  Balance

Understanding what matters

We systematically analyze each component of NIL to understand its contribution.

Each reward component contributes to final quality. Video similarity provides the strongest standalone signal, but all together yield the best result.

All Components

Best result

No Regularization

Jittery motion

No Mask IoU

Distorted behavior

No Video Sim.

Slow, jittery

Only Reg.

Fails to walk straight

Only IoU

Cannot walk

Only Video Sim.

Walks but stops

Reference Video

Generated by Kling

Reward Ablation Table
Quantitative ablation of reward components on Unitree H1.

NIL works across multiple video diffusion models. Better visual quality directly translates to better policies.

Kling AI

Best quality → Best NIL

Pika

Runway Gen-3

OpenAI Sora

Stable Video Diffusion

NIL (Kling)

Resulting policy

LPIPS correlation
Better video quality (lower LPIPS) correlates with better NIL performance.

As video diffusion models improve, NIL directly benefits. Kling v1.0 vs v1.6 shows how better quality yields more natural gaits.

Kling v1.0 Generated

Unbalanced, asymmetric

NIL from Kling v1.0

Learns but with artifacts

Kling v1.6 Generated

Natural, symmetric gait

NIL from Kling v1.6

Significantly more natural

We generate 3 videos per prompt and select the best via optical flow variance and pixel MSE. Training on rejected (sub-optimal) videos shows only mild degradation, indicating NIL is robust to imperfect generations.

Video Selection Robustness
Table 1 (Supp.) — Robustness to video selection. NIL trained on rejected sub-optimal videos.

NIL is reproducible without proprietary video models. Training with open-source diffusion models (WAN, LTX) yields strong performance, confirming the approach is not dependent on any specific commercial API.

Open-Source Model Results
Table 2 (Supp.) — NIL with open-source video diffusion models (WAN, LTX).

NIL remains strong under different camera settings: static views, 45° azimuth rotation, field-of-view jitter (5%/15%), and multi-view setups. Multi-view even improves performance on some robots.

Camera Sensitivity
Table 3 (Supp.) — Camera sensitivity analysis across different viewpoint settings.

Generated videos run at 24 FPS while simulations render at 100 FPS. NIL upsamples via frame interpolation. Testing without upsampling or with alternative rates shows the method is robust to temporal resolution choices.

Frame Interpolation Robustness
Table 4 (Supp.) — Robustness to frame interpolation settings.

BibTeX

@inproceedings{albaba2025nil,
  title={NIL: No-data Imitation Learning},
  author={Albaba, Mert and Li, Chenhao and Diomataris, Markos and Taheri, Omid and Krause, Andreas and Black, Michael J.},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}