Comparing Image to Video models in ComfyUI: SkyReels V1 vs Wan 2.1

This week has seen a lot of action in the GenAI video space with the release of multiple state-of-the-art models. In this post, I compare two of the most promising open source video models: SkyReels V1 and Wan 2.1. Both trained for Text to Video and Image to Video

All my results in the video above are for Image to Video and, with the exception of the Wan-Ballerina, are first or second shot generation. For some reason Wan kept adding disco lights to that clip.

When making the infographic, I tried to be as scientific as possible and compared videos of the same lengths. This does not give full justice to Wan, which could easily produce longer clips. SkyReels on the other hand was struggling to make coherent videos that were any longer than that.

From a quality standpoint, I felt like both videos were comparable when it came to making realistic people, but Wan felt better for other styles.

I ran all my tests on a H100 using basic workflows for both models. In terms of hardware requirements and speed, SkyReels performed better. To put things in numbers:

  • SkyReels: 2.73 seconds per frame with 31GB of VRAM

  • Wan: 2.96 seconds per frame with 37GB of VRAM

If you want to try out those models inside ComfyUI, or access them via an API, we’ve set up some ready-to-use templates on ViewComfy. (click on “deploy a workflow” on the top right when you are there. If the workflow is not already loaded when you open Comfy, you can drop the ones I linked below)

The workflows I used for testing:

Original model repos:

Previous
Previous

Integrate ComfyUI Workflows into your apps via API: A Guide to ViewComfy

Next
Next

Image to Image Face Swap with Flux-PuLID II