NIL: No-data Imitation Learning

The Challenge

Teaching robots to move
shouldn't require mountains of data

Current approaches to robot learning face fundamental bottlenecks. NIL eliminates them all.

🏋

Reinforcement Learning

Requires painstaking manual reward engineering for each new task and robot. Poorly specified rewards lead to unintended behavior.

📹

Imitation Learning

Needs expensive, high-quality 3D motion-capture data. Challenging to obtain for non-humanoid robots and animals.

✨

NIL: Our Approach

Uses AI-generated videos as the sole source of demonstration. No curated data, just a text prompt.

Key Insight: Video diffusion models generate realistic-looking motion for any morphology. While these videos aren't physically accurate, a physics simulator can enforce plausibility. NIL combines both; the video provides visual guidance, the simulator enforces physical constraints.

How It Works

Two stages. Zero data.

NIL generates a reference video from a text prompt, then trains a control policy to physically replicate it in simulation.

Generate Reference Video

Render the robot's initial frame. Feed it with a task description (e.g., "Robot is walking") into a pretrained video diffusion model. Generate a realistic 2D video of the desired motion.

→

Learn by Imitating the Video

Train a reinforcement learning policy in a physics simulator to match the generated video. The reward compares the simulation rendering against the reference video using video encoders and segmentation masks.

The Reward Signal

How does the robot know it's
imitating correctly?

NIL computes a discriminator-free imitation reward by comparing the rendered simulation against the generated video through three complementary signals.

🎬

Video Similarity

A video vision transformer (TimeSFormer) embeds both videos. Cosine similarity between their embeddings provides temporal and semantic guidance.

🎭

Mask IoU

Segment the robot's body in both videos using SAM2. Compute frame-by-frame IoU between masks for precise spatial alignment.

⚖️

Regularization

Penalize excessive joint torques, angular velocities, and action deltas to ensure smooth, stable, physically plausible motion.

R_t = ζ · S_video + β · S_mask + η · P_reg

The combined reward at each timestep guides the policy toward natural, accurate motion.

Results — Locomotion

Walking across diverse robots

NIL learns locomotion for four different robots from a single generated video each. Left: AI-generated reference. Right: physically-plausible learned policy.

Generated Unitree H1

Kling AI — "The H1 robot is walking"

NIL Unitree H1

Learned natural bipedal gait

Generated Talos

Kling AI — heavy-duty humanoid

NIL Talos

Learned walking for complex morphology

Generated Unitree G1

Kling AI — compact humanoid

NIL Unitree G1

Learned compact humanoid locomotion

Generated Unitree A1

Pika — quadruped robot

NIL Unitree A1

Same method, four-legged gait

NIL vs. AMP (Motion-Capture Baseline)

NIL matches AMP's performance despite using zero motion-capture data, while AMP requires 25 curated trajectories per robot.

NIL H1 — No curated data

AMP H1 — 25 MoCap trajectories

NIL Talos — No curated data

AMP Talos — 25 MoCap trajectories

NIL G1 — No curated data

AMP G1 — 25 MoCap trajectories

NIL A1 — No curated data

AMP A1 — 25 MoCap trajectories

Results — Whole-Body Manipulation

Beyond walking: complex skills
from a single video

NIL tackles whole-body manipulation — sitting, hanging, balancing — matching RL baselines that use hand-designed reward functions.

Generated Sit

Generated Hang

Generated Balance

NIL Sit

NIL Hang

NIL Balance

Ablations

Understanding what matters

We systematically analyze each component of NIL to understand its contribution.

Each reward component contributes to final quality. Video similarity provides the strongest standalone signal, but all together yield the best result.

All Components

Best result

No Regularization

Jittery motion

No Mask IoU

Distorted behavior

No Video Sim.

Slow, jittery

Only Reg.

Fails to walk straight

Only IoU

Cannot walk

Only Video Sim.

Walks but stops

Reference Video

Generated by Kling

Quantitative ablation of reward components on Unitree H1.

NIL works across multiple video diffusion models. Better visual quality directly translates to better policies.

Kling AI

Best quality → Best NIL

Pika

Runway Gen-3

OpenAI Sora

Stable Video Diffusion

NIL (Kling)

Resulting policy

Better video quality (lower LPIPS) correlates with better NIL performance.

As video diffusion models improve, NIL directly benefits. Kling v1.0 vs v1.6 shows how better quality yields more natural gaits.

Kling v1.0 Generated

Unbalanced, asymmetric

NIL from Kling v1.0

Learns but with artifacts

Kling v1.6 Generated

Natural, symmetric gait

NIL from Kling v1.6

Significantly more natural

We generate 3 videos per prompt and select the best via optical flow variance and pixel MSE. Training on rejected (sub-optimal) videos shows only mild degradation, indicating NIL is robust to imperfect generations.

Table 1 (Supp.) — Robustness to video selection. NIL trained on rejected sub-optimal videos.

NIL is reproducible without proprietary video models. Training with open-source diffusion models (WAN, LTX) yields strong performance, confirming the approach is not dependent on any specific commercial API.

Table 2 (Supp.) — NIL with open-source video diffusion models (WAN, LTX).

NIL remains strong under different camera settings: static views, 45° azimuth rotation, field-of-view jitter (5%/15%), and multi-view setups. Multi-view even improves performance on some robots.

Table 3 (Supp.) — Camera sensitivity analysis across different viewpoint settings.

Generated videos run at 24 FPS while simulations render at 100 FPS. NIL upsamples via frame interpolation. Testing without upsampling or with alternative rates shows the method is robust to temporal resolution choices.

Table 4 (Supp.) — Robustness to frame interpolation settings.

No-data ImitationLearning

Watch the Explainer

Teaching robots to moveshouldn't require mountains of data