Edit Anything — Experimental LTX-2 Video Editing LoRAs

Heads up. These LoRAs are research experiments. They are far from production-ready and will fail on many inputs. They are released for the community to play with and break, not as a finished tool.

This repository hosts two unrelated training tracks built on top of LTX-2.3 (22B) for video editing:

Edit Anything v1.1 — motion transfer LoRA (two ranks).
Reference video-to-video (Ref V2V) — experimental IC-LoRA + sidecar modules (two builds).

Inference is meant to run through the BFSnodes ComfyUI custom nodes — the Ref V2V build in particular needs them to load the sidecar modules and install the custom branches into the transformer.

1. Edit Anything v1.1 (motion transfer)

Files:

edit_anything_30k_v1.1_motion_transfer_r128.safetensors
edit_anything_30k_v1.1_motion_transfer_r256.safetensors

What it is

v1.1 is not a direct continuation of v1.0. It was trained from scratch in two stages:

Stage 1 — image-only pretraining. ~30 000 image edit pairs. Training a video model on still images is admittedly not ideal, but it was a way to push the editing vocabulary beyond what a small video-only dataset can teach.
Stage 2 — video fine-tune with first_frame_conditioning > 0. This restored the temporal prior and unlocked the motion-transfer behaviour described below.

In theory v1.1 can do the same edits as v1.0, but temporal consistency may be weaker than v1.0 because so much of stage 1 happened on still images. Test against v1.0 case-by-case before assuming v1.1 wins on your task.

Motion transfer

Because stage 2 included first-frame conditioning, you can drive the LoRA into a motion-transfer mode:

Take a guide video.
Replace its first frame with an edited still (insert a new subject, swap an object, etc.). Use a strong image-editing model — Flux Kontext / "Klein" or similar — to prepare it; the quality of this single frame propagates through the whole clip.
Feed the edited frame as the first frame of the input, and the original guide video as the motion source.

The model uses the new first frame as the appearance anchor and copies the motion from the rest of the guide.

Limitations:

Fast or chaotic motion → fails.
Poor blending / artefacts in the first frame propagate everywhere.
Works best when the inserted subject roughly occupies the same region as whatever it replaces.

Prompting

Prompt is just as critical as in v1.0. Describe both the object being replaced and the new one in detail. Example: "Replace the bronze statue on the left with a tall man wearing a navy raincoat and brown boots." Vague prompts produce bad edits.

Which rank to use

The same training produced both files. v1.1 is actually the merge of the two-stage training (one LoRA per stage), re-extracted at two different ranks via Frobenius-optimal truncated SVD:

File	Rank	Size	Frobenius retention
`edit_anything_30k_v1.1_motion_transfer_r128.safetensors`	128	1.31 GB	~99.4%
`edit_anything_30k_v1.1_motion_transfer_r256.safetensors`	256	2.62 GB	~99.9%

r256 is closer to the merged source. r128 is normally indistinguishable in practice. Pick whichever fits your workflow.

2. Reference video-to-video (Ref V2V) — experimental

Files (two builds of the same LoRA family — each ships as a (.standard, .module) pair):

edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors

What it is

The goal is add / replace using a reference image — same vibe as Edit Anything v1.0, but with an explicit image as the appearance source instead of relying only on the prompt.

Trained on ~1600 Add / Replace video pairs. Reference-paired video datasets are basically nonexistent, so the dataset had to be built from scratch — that is why the sample count is small. It often fails. This is fully experimental; thousands of training runs went into landing on this LoRA layout, and it is still unclear how much it actually helps.

Architecture — why this LoRA has "modules"

Trained as a conventional IC-LoRA, plus extra projection branches that try to make the reference signal survive across layers:

ref_visual_proj — projects the reference VAE latent into 32 visual memory tokens.
ref_attn — a dedicated cross-attention branch inside each transformer block, reading those tokens.
ref_adaln_proj — a global AdaLN bias derived from the reference (palette / overall look).
role_embedding — an experimental token bias inspired by some of Kijai's tests; whether it actually helps is still unclear.

These extra weights are saved alongside the LoRA in a .module.safetensors sidecar because they are not standard LoRA adapters — the regular ComfyUI LoRA loader can't consume them, so they need a dedicated node.

How to load

File	What it is	Where it goes
`*.standard.safetensors`	LoRA on `attn1` / `attn2` / `ff` only	Standard ComfyUI LoRA loader
`*.module.safetensors`	`role_embedding`, `ref_adaln_proj`, `ref_visual_proj`, `ref_attn` LoRA adapters	`LTXVEditAnythingModuleLoader` (BFSnodes)

Both files of a pair must be loaded together — the LoRA was trained against the sidecar adapters and they only make sense as a unit. Do not mix .standard from one build with .module from another.

The module file is consumed by the 🅛🅣🅧 LTXV Edit Anything Looping Sampler node, which was written specifically to:

Install the ref_attn cross-attention branch on every transformer block.
Inject the AdaLN / role / visual cross-attention conditioning at the correct points in the model.
Sample long videos in overlapping chunks with the conditioning re-applied per chunk.

Which build to use

ref_adaln_proj-role_embedding — the original training. Only ships the two side-channel modules.
ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj — the continuation. Adds the visual cross-attention branch and its projector on top.

It is genuinely not clear yet whether the extra branches help over the plain LoRA. Both builds are honest experiments. Try both, decide for your own use case, and please share findings.

Reading the layers

For anyone who wants to understand what each layer in the Ref V2V checkpoint does:

lora_layers_reference.md — full tensor inventory of both builds.
lora_layers_impact.md — what each branch contributes at inference and which inference knob (adaln_scale, ref_context_scale, ref_token_scale, ref_start_block, ref_end_block, etc.) maps back to which training default.

Prompt examples

The two LoRAs were trained on very different caption styles. Match the style of whichever LoRA you're using — straying outside the training distribution is the fastest way to get garbage out.

Edit Anything v1.1 — standard editing

The stage-1 dataset uses short imperative captions describing one or two edits. Use the same shape at inference. Examples drawn from the training distribution:

"Replace the stone statue of a man on the left with a young woman in a green dress."
"Add a black labrador retriever sitting beside the woman on the bench."
"Remove the teacher from the classroom."
"Alter the cap's colour from modern black to deep maroon."
"Replace the fresh citrus-green background with a wooden desk."
"Add faint tire tracks across the snow behind the car."
"Add a black statue, a blue camera, a cyan towel, a red guitar and a pink backpack to the lakeside pier."

Tips:

Imperative verbs: Add / Replace / Remove / Alter / Change.
When replacing, describe both the original and the new subject so the model can localise the edit.
Keep captions short and concrete. Long flowery prose hurts.

Edit Anything v1.1 — motion transfer

Workflow:

Pick a guide video.
Edit only the first frame externally (Flux Kontext / "Klein", InstructPix2Pix, etc.) to introduce the new subject in the desired pose and position.
Feed the edited frame as the first frame of the input and the original guide as motion source.
The prompt should describe the inserted subject and the action being preserved.

Examples:

"Replace the standing man holding the umbrella with a woman in a red coat holding the same umbrella, walking across the puddles."
"Add a tabby cat curled up in the armchair while the man in the background keeps reading."
"Replace the runner in the blue jersey with a man wearing a white shirt and grey shorts running along the same path."

Limits: fast or chaotic motion will fail; the inserted subject should occupy roughly the same region/scale as what it replaces.

Reference V2V (Ref V2V) — Add and Replace

These captions are real samples from the ~1600-pair training set. They describe the target scene after the edit in detail. The reference image carries the appearance of the inserted subject; the caption carries position, pose, action, and surrounding context.

Add task (the reference image holds the new subject):

"Add a middle-aged man with curly grey hair, a beard and glasses, wearing a blue quarter-zip sweater, on the right side of the frame, standing in front of a raw cut of meat on a tray."
"Add a light-coloured small boat with dark seats and an outboard motor floating in the water."
"Add an open book filled with colourful pencils in the woman's hands."
"Add a silver metallic bucket on the table in front of the blonde character, with her hands stirring a mixture inside."
"Add two miniature dolls, one blonde and one brunette, dressed in patterned clothing, sitting at a small table with teacups and small white vases on the countertop."

Replace task (the reference image holds the new subject; the caption also describes what is being replaced):

"Replace the standing kangaroo holding the bicycle handlebars with a man wearing a white t-shirt, light brown shorts and a yellow cap, holding the bicycle handlebars."
"Replace the stone statue of a man on the left side with a young woman in a green dress."
"Replace the wooden barrel near the entrance with a large brown leather suitcase."

Tips for Ref V2V:

Describe the inserted subject in full, even though the reference image is the source of truth — the text path drives placement and pose.
For Replace, also describe what is being replaced so the model can match the spatial region.
Keep the inserted subject roughly in the same scale and region as what it replaces.
The captions in the training set average ~25–40 words — aim for that range. Single-sentence captions like "Add a man" are far too sparse and will fail.

ComfyUI nodes

All recommended inference paths run through the BFSnodes custom node set. For now BFSnodes is the only place these nodes live; once they stabilise they may move elsewhere.

Specific nodes used by these LoRAs:

🅛🅣🅧 LTXV Edit Anything Looping Sampler — sampler that injects role / AdaLN / visual cross-attention and handles long videos in chunks.
LTXVEditAnythingModuleLoader — load the *.module.safetensors sidecar.

Status

Released as experimental research artefacts. Expect failures, do not deploy, and please report what works and what doesn't.

Credits

If you use these models — in a project, a demo, a paper, a video, a tweet, a workflow, anything — please credit my work. These checkpoints are the result of weeks of research, dataset building, and training runs, and that effort is what makes any of it usable. Crediting the source is the bare minimum that keeps open research like this sustainable.

Author: Alisson Pereira dos Anjos (@Alissonerdx)

Suggested attribution:

Edit Anything LoRAs by Alisson Pereira dos Anjos (huggingface.co/Alissonerdx/EditAnything).

Links back to this repository are appreciated wherever you publish results.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Alissonerdx/EditAnything

Base model

Lightricks/LTX-2.3

Adapter

(46)

this model