---
license: apache-2.0
language:
- en
base_model:
- tencent/HunyuanVideo
pipeline_tag: text-to-video
tags:
- avatar
- talking
- audio
---

# MoCha Demo Implementation

[MoCha](https://congwei1230.github.io/MoCha/) is a pioneering model for **Dialogue-driven Movie Shot Generation**.

| [**🌐Project Page**](https://congwei1230.github.io/MoCha/) | [**📖Paper**](https://arxiv.org/pdf/2503.23307) | [**🔗Github**](https://github.com/congwei1230/MoCha-Demo) | [**🤗Demo**](https://huggingface.co/datasets/CongWei1230/MoCha-Generation-on-MoChaBench-Visualizer)|


This repository provides a demo implementation of MoCha Towards Movie-Grade Talking Character Synthesis. built on top of HunyuanVideo.

We fine-tune HunyuanVideo on the Hallo3 dataset. Due to differences in training data, model scale, and training strategy, this demo does not fully reproduce the performance of the original MoCha model, but it reflects the core design and and serves as a baseline for further research and study.

This implementation supports two generation modes:

st2v: speech + text → video

sti2v: image + speech + text → video

Check out the [**🔗Github**](https://github.com/congwei1230/MoCha-Demo) for usage.