Skip to content

Conversation

@stduhpf
Copy link
Contributor

@stduhpf stduhpf commented Nov 3, 2025

https://github.com/madebyollin/taehv

Model weights:

.\bin\Release\sd.exe --diffusion-model ..\..\ComfyUI\models\diffusion_models\qwen-image-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\qwen_image_vae.safetensors --qwen2vl ..\..\ComfyUI\models\text_encoders\Qwen2.5-VL-7B-Instruct-Q8_0.gguf -p '一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 “一、Qwen-Image的技术路线: 探索视觉生成基础模型的极限,开创理解与生成一体化的未来。二、Qwen-Image的模型特色:1、复杂文字渲染。支持中英渲染、自动布局; 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景:赋能专业内容创作、助力生成式AI发展。”' --cfg-scale 2.5 --sampling-method euler -v --offload-to-cpu -H 1024 -W 1024 --diffusion-fa --flow-shift 3 --tae ..\ComfyUI\models\vae_approx\taew2_1.pth --vae-conv-direct

output

.\bin\Release\sd-cli.exe -M vid_gen --diffusion-model '..\..\ComfyUI\models\unet\Wan2.2-TI2V-5B-Q8_0.gguf' --t5xxl ..\..\ComfyUI\models\clip\t5\umt5-xxl-encoder-Q8_0.gguf --tae ..\..\ComfyUI\models\vae_approx\taew2_2.pth -p "The woman drops the marker, and then she starts laughing a bit" -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" --cfg-scale 5.0 --sampling-method euler -v -W 768 -H 768 --color --video-frames 49 -i .\image.png --vae-conv-direct --scheduler smoothstep --steps 50 --fps 24 --diffusion-fa

output.mp4

Speedup and memory saving aren't that impressive yet, maybe it can be improved further?

@stduhpf
Copy link
Contributor Author

stduhpf commented Nov 3, 2025

Sorry for the unrelated whitespace changes and the debug spam, will fix later

@stduhpf
Copy link
Contributor Author

stduhpf commented Nov 3, 2025

Oh a new version of the taew2.1 weights just came out, coincidentally.

Old Weights New Weights
output - Copy (112) output

@stduhpf
Copy link
Contributor Author

stduhpf commented Nov 3, 2025

Now tae decoding for the outputs of Wan2.1 models (and Wan2.2 A14B) works in txt2img mode.

Video decoding is running as well, but the results are obviously incorrect (flashing lights warning)

If someone can see what I'm doing wrong when decoding videos, let me know.

@madebyollin
Copy link

madebyollin commented Dec 11, 2025

After fixing the three bugs mentioned in review, image results look correct (tested on GH200 with -DSD_CUDA=ON). I didn't check video.
image

diffs
diff --git a/tae.hpp b/tae.hpp
index ad0bd37..6a7951f 100644
--- a/tae.hpp
+++ b/tae.hpp
@@ -224,7 +224,7 @@ public:
         h      = conv1->forward(ctx, h);
         h      = ggml_relu_inplace(ctx->ggml_ctx, h);
         h      = conv2->forward(ctx, h);
-        h      = ggml_relu_inplace(ctx->ggml_ctx, h);
+        // h      = ggml_relu_inplace(ctx->ggml_ctx, h);
 
         auto skip = x;
         if (has_skip_conv) {
@@ -323,7 +323,7 @@ public:
         for (int i = 0; i < num_layers; i++) {
             for (int j = 0; j < num_blocks; j++) {
                 auto block = std::dynamic_pointer_cast<MemBlock>(blocks[std::to_string(index++)]);
-                auto mem   = ggml_pad(ctx->ggml_ctx, h, 0, 0, 0, 1);
+                auto mem   = ggml_pad_ext(ctx->ggml_ctx, h, 0, 0, 0, 0, 0, 0, 1, 0);
                 mem        = ggml_view_4d(ctx->ggml_ctx, mem, h->ne[0], h->ne[1], h->ne[2], h->ne[3], h->nb[1], h->nb[2], h->nb[3], 0);
                 h          = block->forward(ctx, h, mem);
             }
@@ -341,7 +341,7 @@ public:
         h              = last_conv->forward(ctx, h);
 
         // shape(W, H, 3, T+3) => shape(W, H, 3, T)
-        h = ggml_view_4d(ctx->ggml_ctx, h, h->ne[0], h->ne[1], h->ne[2], h->ne[3] - 3, h->nb[1], h->nb[2], h->nb[3], 0);
+        h = ggml_view_4d(ctx->ggml_ctx, h, h->ne[0], h->ne[1], h->ne[2], h->ne[3] - 3, h->nb[1], h->nb[2], h->nb[3], 3*h->nb[3]);
         return h;

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 11, 2025

Video is still completely broken, but image decoding works very well now.

@stduhpf stduhpf marked this pull request as ready for review December 13, 2025 21:10
@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 13, 2025

Results for taew2.2 are quite interesting for now.

output.mp4

@madebyollin
Copy link

Wan 2.2 and Hunyuan 1.5 have 2x2 pixelshuffle on the input/output

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 13, 2025

@madebyollin Yes I saw that when looking at the VAE code in wan.hpp, I'm on it

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 13, 2025

output.mp4

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 13, 2025

output.mp4

@madebyollin
Copy link

The Wan 2.2 TI2V results still look broken. There's a scaling issue on ~L3600 where sd_ctx->sd->process_latent_out(init_latent); and sd_ctx->sd->process_latent_in(init_latent); are incorrectly called even when using TAEW2.2. After fixing that, initial frame results look correctly-scaled but the video deteriorates into gray mush:

output_with_disabled_process_latent.mp4

This gray-mush issue happens with the default VAE on 8f05f5bc6ee9d6aba9d1ff2be7739a5a3cf1586d (before this PR) so fixing it is likely out of scope for this PR.

output_with_official_vae_on_8f05f5bc6ee9d6aba9d1ff2be7739a5a3cf1586d.mp4

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 15, 2025

@madebyollin yes I figured it was probably something that after noticing how much worse the img2vid results were compared to txt2vid. I get no "gray-mush" on my end with this fix though.

@leejet
Copy link
Owner

leejet commented Dec 15, 2025

@stduhpf I used taehv and get results very close to the results of wan vae. Maybe this PR can be merged now?

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 15, 2025

I think so too. I haven't tested every possible use case though (for example VACE).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants