Speed up ComfyUI Image and Video generation with TeaCache
One of the biggest problems when it comes to generating images or video is how slow the process can be. Fortunately, we now have a few good tricks to help speed up generation. In this post, we’ll go over our preferred solution when using ComfyUI: TeaCache and model compiling. During testing, we managed to speed up generation time by 3X with FLUX and 2.8X with Wan2.1 with no loss in quality.
Without going into the details, TeaCache uses clever caching to take advantage of the fact that the output of many of the attention blocks inside diffusion models is very similar to their input. While model compiling speeds up inference by optimizing the model’s code. The great thing is that both of those techniques work out of the box with any ComfyUI workflow.
Performance improvement
TeaCache really feels like a free lunch and is perfect to speed up API inference and ComfyUI workloads in general. However, model compiling has an important drawback which should be mentioned before we start: the first 2 to 3 generations of every session are much slower. This can make it hard to use effectively when running ComfyUI as an API unless servers are running for extended periods of time.
We ran a few tests on Flux Dev and Wan2.1 text to video to quantify the performance gains:
GPU | With no cache | With TeaCache | With TeaCache and Model Compiling |
---|---|---|---|
L4 | 63.65 sec (0.5 it/s) | 33.17 sec (1.03 it/s) | 25.74 sec (1.34 it/s) |
A100-40GB | 13.67 sec (2.29 it/s) | 6.92 sec (4.66 it/s) | 4.89 sec (6.60 it/s) |
H100 | 6.74 sec (4.7 it/s) | 3.49 sec (9.34 it/s) | 2.37 sec (14.38 it/s) |
GPU | With no cache | With TeaCache | With TeaCache and Model Compiling |
---|---|---|---|
L4 | 133.33 sec (0.24 it/s) | 62.73 sec (0.54 it/s) | 49.56 sec (0.70 it/s) |
A100-40GB | 34.13 sec (0.94 it/s) | 16.65 sec (2.11 it/s) | 13.4 sec (2.74 it/s) |
H100 | 17.68 sec (1.90 it/s) | 9.11 sec (4.13 it/s) | 7.54 sec (5.35 it/s) |
As you can see, TeaCache on its own roughly halves the generation time with Flux and Wan2.1. And when combined with model compiling, you are looking at 2.5X to 3X speed improvements.
Utilisation
The TeaCache node from the ComfyUI-TeaCache node pack comes with two parameters.
The first one, rel_l1_thresh, controls how often the cache will be refreshed. At 0 it is turned off, while at higher values, the cache is refreshed more often. The other parameter, max_skip_steps, sets the maximum number of steps that the cached memory can skip.
The higher those values, the faster the generation will be. That said, we found during testing that at values higher than 0.4 / 3 for Flux and 0.2 / 3 for Wan2.1, generations started losing some details.
Here are some examples using Flux Dev with different threshold values and skip steps 3:
And again with Wan2.1:
As you can see, with Flux, results hardly change between 0 and 0.4 and start losing quality after that. While with Wan, results tend to be a lot more sensitive to the threshold. We found that 0.2 gave good results, like in the example above.
Installation
All you have to do to use these techniques is to install this node pack, and add the TeaCache and/or the Compile Model nodes after loading the diffusion model (if you are using LoRAs, the TeaCache node will go after the Load LoRA node)
For a quick start, we’ve set up a template with everything you need to get started using TeaCache with Flux on ViewComfy. You can also use the Wan2.1 template and install ComfyUI-TeaCache to skip the model installation and run the model on Cloud GPUs. Both those templates work out of the box with ViewComfy’s serverless API and can easily be integrated into applications.