How to use from the
Use from the
Diffusers library
pip install -U diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import load_image, export_to_video

# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("Skywork/SkyReels-V1-Hunyuan-I2V", dtype=torch.bfloat16, device_map="cuda")
pipe.to("cuda")

prompt = "A man with short gray hair plays a red electric guitar."
image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png"
)

output = pipe(image=image, prompt=prompt).frames[0]
export_to_video(output, "output.mp4")

Skyreels V1: Human-Centric Video Foundation Model

SkyReels Logo

🌐 Github Β· πŸ‘‹ Playground Β· πŸ’¬ Discord


This repo contains Diffusers-format model weights for SkyReels V1 Image-to-Video models. You can find the inference code on our github repository SkyReels-V1.

Introduction

SkyReels V1 is the first and most advanced open-source human-centric video foundation model. By fine-tuning HunyuanVideo on O(10M) high-quality film and television clips, Skyreels V1 offers three key advantages:

  1. Open-Source Leadership: Our Text-to-Video model achieves state-of-the-art (SOTA) performance among open-source models, comparable to proprietary models like Kling and Hailuo.
  2. Advanced Facial Animation: Captures 33 distinct facial expressions with over 400 natural movement combinations, accurately reflecting human emotions.
  3. Cinematic Lighting and Aesthetics: Trained on high-quality Hollywood-level film and television data, each generated frame exhibits cinematic quality in composition, actor positioning, and camera angles.

πŸ”‘ Key Features

1. Self-Developed Data Cleaning and Annotation Pipeline

Our model is built on a self-developed data cleaning and annotation pipeline, creating a vast dataset of high-quality film, television, and documentary content.

  • Expression Classification: Categorizes human facial expressions into 33 distinct types.
  • Character Spatial Awareness: Utilizes 3D human reconstruction technology to understand spatial relationships between multiple people in a video, enabling film-level character positioning.
  • Action Recognition: Constructs over 400 action semantic units to achieve a precise understanding of human actions.
  • Scene Understanding: Conducts cross-modal correlation analysis of clothing, scenes, and plots.

2. Multi-Stage Image-to-Video Pretraining

Our multi-stage pretraining pipeline, inspired by the HunyuanVideo design, consists of the following stages:

  • Stage 1: Model Domain Transfer Pretraining: We use a large dataset (O(10M) of film and television content) to adapt the text-to-video model to the human-centric video domain.
  • Stage 2: Image-to-Video Model Pretraining: We convert the text-to-video model from Stage 1 into an image-to-video model by adjusting the conv-in parameters. This new model is then pretrained on the same dataset used in Stage 1.
  • Stage 3: High-Quality Fine-Tuning: We fine-tune the image-to-video model on a high-quality subset of the original dataset, ensuring superior performance and quality.

Model Introduction

Model Name Resolution Video Length FPS Download Link
SkyReels-V1-Hunyuan-I2V (Current) 544px960p 97 24 πŸ€— Download
SkyReels-V1-Hunyuan-T2V 544px960p 97 24 πŸ€— Download

Usage

See the Guide for details.

Citation

@misc{SkyReelsV1,
  author = {SkyReels-AI},
  title = {Skyreels V1: Human-Centric Video Foundation Model},
  year = {2025},
  publisher = {Huggingface},
  journal = {Huggingface repository},
  howpublished = {\url{https://huggingface.co/Skywork/Skyreels-V1-Hunyuan-I2V}}
}
Downloads last month
674
Inference Providers NEW

Model tree for Skywork/SkyReels-V1-Hunyuan-I2V

Finetuned
(33)
this model
Finetunes
1 model

Spaces using Skywork/SkyReels-V1-Hunyuan-I2V 12

Collection including Skywork/SkyReels-V1-Hunyuan-I2V