arxiv:2604.01043

ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

Published on Apr 1

· Submitted by

Fengyuan Yang on Apr 7

National University of Singapore

Upvote

Authors:

Fengyuan Yang ,

Luying Huang ,

Abstract

ONE-SHOT enables compositional human-environment video generation through disentangled signals, dynamic positional embeddings, and hybrid context integration for improved control and diversity.

AI-generated summary

Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subjects and scenes remains a critical challenge. Recent attempts to incorporate richer environment control through rigid 3D geometric compositions often encounter a stark trade-off between precise control and generative flexibility. Furthermore, the heavy 3D pre-processing still limits practical scalability. In this paper, we propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation. Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. We also propose Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate spatial domains without any heuristic 3D alignments. To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis. Our project has been available on: https://martayang.github.io/ONE-SHOT/.

View arXiv page View PDF Project page Add to collection

Community

MartaYang007

Paper author Paper submitter about 9 hours ago

We present ONE-SHOT, a compositional human-environment video synthesis framework that enables flexible recombination of scene, identity, motion, and camera trajectory.
The key ideas are spatially decoupled motion injection and Dynamic-Grounded-RoPE for cross-scene grounding, avoiding explicit 3D alignment heuristics.
It supports controllable long-horizon generation with stronger human-scene consistency and revisit stability.
Project page and code are linked below.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.01043

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.01043 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.01043 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.01043 in a Space README.md to link it from this page.