class diffusers.Kandinsky5T2VPipelinediffusers.Kandinsky5T2VPipelinehttps://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky5/pipeline_kandinsky.py#L107[{"name": "transformer", "val": ": Kandinsky5Transformer3DModel"}, {"name": "vae", "val": ": AutoencoderKLHunyuanVideo"}, {"name": "text_encoder", "val": ": Qwen2_5_VLForConditionalGeneration"}, {"name": "tokenizer", "val": ": Qwen2VLProcessor"}, {"name": "text_encoder_2", "val": ": CLIPTextModel"}, {"name": "tokenizer_2", "val": ": CLIPTokenizer"}, {"name": "scheduler", "val": ": FlowMatchEulerDiscreteScheduler"}]- **transformer** (`Kandinsky5Transformer3DModel`) -- Conditional Transformer to denoise the encoded video latents. - **vae** ([AutoencoderKLHunyuanVideo](/docs/diffusers/main/en/api/models/autoencoder_kl_hunyuan_video#diffusers.AutoencoderKLHunyuanVideo)) -- Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations. - **text_encoder** (`Qwen2_5_VLForConditionalGeneration`) -- Frozen text-encoder (Qwen2.5-VL). - **tokenizer** (`AutoProcessor`) -- Tokenizer for Qwen2.5-VL. - **text_encoder_2** (`CLIPTextModel`) -- Frozen CLIP text encoder. - **tokenizer_2** (`CLIPTokenizer`) -- Tokenizer for CLIP. - **scheduler** ([FlowMatchEulerDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler)) -- A scheduler to be used in combination with `transformer` to denoise the encoded video latents.0 Pipeline for text-to-video generation using Kandinsky 5.0. This model inherits from [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__diffusers.Kandinsky5T2VPipeline.__call__https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky5/pipeline_kandinsky.py#L615[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "height", "val": ": int = 512"}, {"name": "width", "val": ": int = 768"}, {"name": "num_frames", "val": ": int = 121"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "guidance_scale", "val": ": float = 5.0"}, {"name": "num_videos_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds_qwen", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds_clip", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds_qwen", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds_clip", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_cu_seqlens", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_cu_seqlens", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback_on_step_end", "val": ": typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "max_sequence_length", "val": ": int = 512"}, {"name": "**kwargs", "val": ""}]- **prompt** (`str` or `List[str]`, *optional*) -- The prompt or prompts to guide the video generation. If not defined, pass `prompt_embeds` instead. - **negative_prompt** (`str` or `List[str]`, *optional*) -- The prompt or prompts to avoid during video generation. If not defined, pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale` < `1`). - **height** (`int`, defaults to `512`) -- The height in pixels of the generated video. - **width** (`int`, defaults to `768`) -- The width in pixels of the generated video. - **num_frames** (`int`, defaults to `25`) -- The number of frames in the generated video. - **num_inference_steps** (`int`, defaults to `50`) -- The number of denoising steps. - **guidance_scale** (`float`, defaults to `5.0`) -- Guidance scale as defined in classifier-free guidance. - **num_videos_per_prompt** (`int`, *optional*, defaults to 1) -- The number of videos to generate per prompt. - **generator** (`torch.Generator` or `List[torch.Generator]`, *optional*) -- A torch generator to make generation deterministic. - **latents** (`torch.Tensor`, *optional*) -- Pre-generated noisy latents. - **prompt_embeds** (`torch.Tensor`, *optional*) -- Pre-generated text embeddings. - **negative_prompt_embeds** (`torch.Tensor`, *optional*) -- Pre-generated negative text embeddings. - **output_type** (`str`, *optional*, defaults to `"pil"`) -- The output format of the generated video. - **return_dict** (`bool`, *optional*, defaults to `True`) -- Whether or not to return a `KandinskyPipelineOutput`. - **callback_on_step_end** (`Callable`, `PipelineCallback`, `MultiPipelineCallbacks`, *optional*) -- A function that is called at the end of each denoising step. - **callback_on_step_end_tensor_inputs** (`List`, *optional*) -- The list of tensor inputs for the `callback_on_step_end` function. - **max_sequence_length** (`int`, defaults to `512`) -- The maximum sequence length for text encoding.0`~KandinskyPipelineOutput` or `tuple`If `return_dict` is `True`, `KandinskyPipelineOutput` is returned, otherwise a `tuple` is returned where the first element is a list with the generated images. The call function to the pipeline for generation. Examples: ```python >>> import torch >>> from diffusers import Kandinsky5T2VPipeline >>> from diffusers.utils import export_to_video >>> # Available models: >>> # ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers >>> # ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-5s-Diffusers >>> # ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusers >>> # ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-5s-Diffusers >>> model_id = "ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers" >>> pipe = Kandinsky5T2VPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) >>> pipe = pipe.to("cuda") >>> prompt = "A cat and a dog baking a cake together in a kitchen." >>> negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards" >>> output = pipe( ... prompt=prompt, ... negative_prompt=negative_prompt, ... height=512, ... width=768, ... num_frames=121, ... num_inference_steps=50, ... guidance_scale=5.0, ... ).frames[0] >>> export_to_video(output, "output.mp4", fps=24, quality=9) ```

check_inputsdiffusers.Kandinsky5T2VPipeline.check_inputshttps://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky5/pipeline_kandinsky.py#L446[{"name": "prompt", "val": ""}, {"name": "negative_prompt", "val": ""}, {"name": "height", "val": ""}, {"name": "width", "val": ""}, {"name": "prompt_embeds_qwen", "val": " = None"}, {"name": "prompt_embeds_clip", "val": " = None"}, {"name": "negative_prompt_embeds_qwen", "val": " = None"}, {"name": "negative_prompt_embeds_clip", "val": " = None"}, {"name": "prompt_cu_seqlens", "val": " = None"}, {"name": "negative_prompt_cu_seqlens", "val": " = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": " = None"}]- **prompt** -- Input prompt - **negative_prompt** -- Negative prompt for guidance - **height** -- Video height - **width** -- Video width - **prompt_embeds_qwen** -- Pre-computed Qwen prompt embeddings - **prompt_embeds_clip** -- Pre-computed CLIP prompt embeddings - **negative_prompt_embeds_qwen** -- Pre-computed Qwen negative prompt embeddings - **negative_prompt_embeds_clip** -- Pre-computed CLIP negative prompt embeddings - **prompt_cu_seqlens** -- Pre-computed cumulative sequence lengths for Qwen positive prompt - **negative_prompt_cu_seqlens** -- Pre-computed cumulative sequence lengths for Qwen negative prompt - **callback_on_step_end_tensor_inputs** -- Callback tensor inputs0- ``ValueError`` -- If inputs are invalid``ValueError`` Validate input parameters for the pipeline.

encode_promptdiffusers.Kandinsky5T2VPipeline.encode_prompthttps://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky5/pipeline_kandinsky.py#L353[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]]"}, {"name": "num_videos_per_prompt", "val": ": int = 1"}, {"name": "max_sequence_length", "val": ": int = 512"}, {"name": "device", "val": ": typing.Optional[torch.device] = None"}, {"name": "dtype", "val": ": typing.Optional[torch.dtype] = None"}]- **prompt** (`str` or `List[str]`) -- Prompt to be encoded. - **num_videos_per_prompt** (`int`, *optional*, defaults to 1) -- Number of videos to generate per prompt. - **max_sequence_length** (`int`, *optional*, defaults to 512) -- Maximum sequence length for text encoding. - **device** (`torch.device`, *optional*) -- Torch device. - **dtype** (`torch.dtype`, *optional*) -- Torch dtype.0Tuple[torch.Tensor, torch.Tensor, torch.Tensor]- Qwen text embeddings of shape (batch_size * num_videos_per_prompt, sequence_length, embedding_dim) - CLIP pooled embeddings of shape (batch_size * num_videos_per_prompt, clip_embedding_dim) - Cumulative sequence lengths (`cu_seqlens`) for Qwen embeddings of shape (batch_size * num_videos_per_prompt + 1,) Encodes a single prompt (positive or negative) into text encoder hidden states. This method combines embeddings from both Qwen2.5-VL and CLIP text encoders to create comprehensive text representations for video generation.

fast_sta_nabladiffusers.Kandinsky5T2VPipeline.fast_sta_nablahttps://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky5/pipeline_kandinsky.py#L182[{"name": "T", "val": ": int"}, {"name": "H", "val": ": int"}, {"name": "W", "val": ": int"}, {"name": "wT", "val": ": int = 3"}, {"name": "wH", "val": ": int = 3"}, {"name": "wW", "val": ": int = 3"}, {"name": "device", "val": " = 'cuda'"}]- **T** (int) -- Number of temporal frames - **H** (int) -- Height in latent space - **W** (int) -- Width in latent space - **wT** (int) -- Temporal attention window size - **wH** (int) -- Height attention window size - **wW** (int) -- Width attention window size - **device** (str) -- Device to create tensor on0torch.TensorSparse attention mask of shape (T*H*W, T*H*W) Create a sparse temporal attention (STA) mask for efficient video generation. This method generates a mask that limits attention to nearby frames and spatial positions, reducing computational complexity for video generation.

get_sparse_paramsdiffusers.Kandinsky5T2VPipeline.get_sparse_paramshttps://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky5/pipeline_kandinsky.py#L217[{"name": "sample", "val": ""}, {"name": "device", "val": ""}]- **sample** (torch.Tensor) -- Input sample tensor - **device** (torch.device) -- Device to place tensors on0DictDictionary containing sparse attention parameters Generate sparse attention parameters for the transformer based on sample dimensions. This method computes the sparse attention configuration needed for efficient video processing in the transformer model.

prepare_latentsdiffusers.Kandinsky5T2VPipeline.prepare_latentshttps://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky5/pipeline_kandinsky.py#L527[{"name": "batch_size", "val": ": int"}, {"name": "num_channels_latents", "val": ": int = 16"}, {"name": "height", "val": ": int = 480"}, {"name": "width", "val": ": int = 832"}, {"name": "num_frames", "val": ": int = 81"}, {"name": "dtype", "val": ": typing.Optional[torch.dtype] = None"}, {"name": "device", "val": ": typing.Optional[torch.device] = None"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}]- **batch_size** (int) -- Number of videos to generate - **num_channels_latents** (int) -- Number of channels in latent space - **height** (int) -- Height of generated video - **width** (int) -- Width of generated video - **num_frames** (int) -- Number of frames in video - **dtype** (torch.dtype) -- Data type for latents - **device** (torch.device) -- Device to create latents on - **generator** (torch.Generator) -- Random number generator - **latents** (torch.Tensor) -- Pre-existing latents to use0torch.TensorPrepared latent tensor Prepare initial latent variables for video generation. This method creates random noise latents or uses provided latents as starting point for the denoising process.