why vision tower layers not included in comfyui version?

#5
by shivshankar - opened

why vision tower layers not included?

shivshankar changed discussion title from why vision tower layers not included? to why vision tower layers not included in comfyui version?

why vision tower layers not included?

Simply because LTX-2 does not use the vision capabilities at all in its pipeline. The ComfyUI version strips out the vision components to save VRAM and disk space as it's never used. LTX2 only uses the text encoding capabilities to generate the embeddings.

Umm, lt has, ltx can do image 2 video and vision can describe image and can be used for image to video generation, thats why gemma text encoder provided by comfyui includes them. Two recent nodes require otherwise throw mismatch errors

The vision tower layers are not used because LTX-2's architecture doesn't process images through Gemma at all. Image-to-video works by encoding the input image through the VAE and replacing the latent at the first frame. Gemma only ever sees text.

The source code describes the text encoder as "Gemma text encoder implementation with tokenizers, feature extractors, and separate encoders for audio-video and video-only generation." No vision components: https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-core/README.md

The architecture is: Gemma 3 Backbone processes text tokens into embeddings, then a Multi-Layer Feature Extractor aggregates from the decoder layers, then a Text Connector feeds into the DiT. The input is text tokens through decoder layers only. The vision tower is a separate component (SigLIP-based) that is never referenced in this pipeline.

Image conditioning in the pipeline uses "Replacing Latents." It encodes the image via the VAE and replaces the latent at a specific frame. Gemma is never involved: https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/README.md

Regarding the mismatch errors you're seeing, this is a known ComfyUI regression, not an LTX-2 issue. A recent ComfyUI update added multimodal/vision support to their Gemma 3 loader for other models, which broke LTX-2 workflows. The new loader reserves/shifts tokens for vision tasks that LTX-2 doesn't use, corrupting spatial alignment and causing shape mismatches. The fix is to roll back the ComfyUI commit or use the LTXAVTextEncoderLoader node instead: https://github.com/Comfy-Org/ComfyUI/issues/11920

The ComfyUI version strips the vision weights intentionally to save VRAM and disk space. The mismatch errors are caused by ComfyUI's loader trying to account for vision layers that LTX-2 never uses.

If you're still hitting issues, can you share which nodes you're using and your workflow so I can try to replicate it. So far I haven't been able to on my system here, although I haven't used LTX2.3 yet.

image
this built in nodes, takes image as input that require vision

i was talking about image to text output from textgenerate node and than take that output as positive text. it has nothing to with what ltx do.

Ahh ok thanks for that update, it makes sense now. The TextGenerateLTX2Prompt node uses Gemma for prompt enhancement and can optionally take an image input for image analysis/captioning. That part would use Gemma's vision capabilities. This seems to be new in ComfyUI core nodes and was introduced February 20th this year. It wasn't available when I originally made these text encoders. So that explains the confusion I had regarding the issue.

This is separate from LTX-2's actual generation pipeline. LTX-2 itself only uses Gemma as a text encoder to generate embeddings that condition the DiT. The Gemma model packaged for LTX2 in ComfyUI had the vision weights intentionally removed, so it cannot be used for vision capabilities. If you're connecting an image to that node with the stripped-down Gemma, that would explain the mismatch errors. Maybe the newer LTX2 workflows use a Gemma with vision+text capabilities.

If you need vision-based prompt enhancement, you'd need to load the full Gemma 3 12B with vision weights, not the LTX2-specific text-only version.

Given that his must be new given the recent comments I can look at remaking a new model with vision capabilities. Although keep in mind Gemma itself is rather censored and with it's training data, it didn't learn many taboo subjects. So even without the refusals it still wont know a lot of things. I wrote more about this here https://huggingface.co/DreamFast/gemma-3-12b-it-heretic/discussions/3

Thanks for the update as it helped to become clear what the issue is. I should make a quick note in the README.md about this and this node.

also NVFP4 is supported now in comfyui, created with comfy kitchen script, please upload them. i am using gemma_3_12B_it_nvfp4_uncalibrated.safetensors. saves memory alot.

https://huggingface.co/DreamFast/gemma-3-12b-it-heretic-v2 check out version 2 with vision support and nvfp4. I'll leave this thread open for others to read easier.

Tested okay for me here. Let me know how it goes!

Sign up or log in to comment