Creative Exploration - Ultra-fast 4 step SDXL animation | SDXL-Lightning & HotShot in ComfyUI

25 Feb 202477:21

TLDRIn this video, the host dives into the world of AI animation using a tool called Hot Shot with SDXL Lightning in ComfyUI. They explore creating fast animations in just four steps, which is significantly quicker than traditional methods. The video discusses the use of depth maps, the potential of dreaming with empty latent images, and the application of various prompts for creative outcomes. The host also touches upon the limitations when using certain models like Lightning, which may not be suitable for commercial use due to licensing restrictions. Throughout the stream, the audience is taken on a journey of experimentation with different inputs, such as videos of people dancing or hovercrafts, and the impact on the resulting animations. The summary also highlights the technical aspects, like the importance of VRAM for processing power and the trade-offs between speed and quality in AI animations. The host encourages viewers to join the community on Discord for further exploration and support.


  • 🎬 The video demonstrates how to create animations using HotShot with SDXL Lightning in ComfyUI, achieving fast results in just four steps.
  • πŸš€ The workflow is efficient for video-to-video animations when using input footage and depth maps, but can also generate animations from an empty latent space.
  • 🌟 The quality and consistency of the animations are impressive, with the potential for commercial use, although the licensing for Lightning models should be checked.
  • πŸ“š The tutorial credits a workflow posted on Banad Doo server called 'vidto vid sdxl for stops lightning Laura' by Kilner, kintner, which is recommended for those interested in following along.
  • 🧩 The process involves using various models and tools such as Dina din Vision XL, Lightning Luras, and the four-step Laura model to create animations.
  • πŸ” The video discusses the use of control nets, like depth and line art control nets, to impose composition and details onto the animations.
  • πŸ”„ The artist experimented with different input footages and settings to see how they affect the animation output, noting the challenges with certain types of motion and noise.
  • 🚧 The video highlights the limitations when using Apple's M1 Max chip due to the lack of CUDA cores, suggesting cloud solutions for better performance.
  • πŸ“‰ The creator shares tips on how to manage VRAM usage effectively and the potential future of VRAM needs in the context of AI and gaming.
  • 🌐 The script mentions the use of CC0 licensed footage from websites like Pexels and Pixabay for input, emphasizing the importance of checking licenses before use.
  • βš™οΈ The workflow and models used are part of an ongoing exploration in AI animation, with the potential for significant advancements in speed and quality of animations.

Q & A

  • What is the main focus of the video?

    -The main focus of the video is to explore the process of creating animations using the 'HotShot' model in SDXL with the help of a four-step workflow called 'vidto vid sdxl for stops lightning Laura' by Kilner.

  • What is the significance of using depth maps in this workflow?

    -Depth maps are used in this workflow to enforce the structure of the input footage onto the generated animations, which helps in maintaining consistency and quality in the output.

  • Why is the 'HotShot' model considered fast for animation creation?

    -The 'HotShot' model is considered fast because it uses a four-step process to generate animations, as opposed to the typical 25-30 steps required by other models, leading to quicker iterations and faster animation production.

  • What is the role of the 'Dina din Vision XL' in this process?

    -The 'Dina din Vision XL' is used as a checkpoint in the process, which works well for the animation creation process described in the video.

  • What are the licensing considerations for using the 'lightning luras'?

    -The 'lightning luras' may not be usable for commercial purposes as they are under a research license. It is advised to check the licenses before using them for any commercial work.

  • What is the recommended aspect ratio for the input footage when using this workflow?

    -The recommended aspect ratio for the input footage is 1024 by 576, which is wider rather than longer, to better fit the animation requirements.

  • What is the impact of using an empty latent space instead of input footage in the animation process?

    -Using an empty latent space allows the model to 'dream' and generate animations without being constrained by the input footage, which can potentially lead to more creative and less artifacted animations.

  • How does the 'Animate Diff' model differ from 'HotShot' in terms of motion context?

    -The 'Animate Diff' model has a 16-frame motion context window, whereas 'HotShot' operates on an 8-frame window, allowing 'Animate Diff' to have smoother and more gradual transitions between motions.

  • What is the significance of the 'Control Net' in the animation process?

    -The 'Control Net' is used to impose a composition on the animation by using different models trained on various databases of photo information, such as depth maps or line art, to guide the animation generation.

  • What are some of the challenges faced when trying to run this workflow on a MacBook?

    -The main challenge is the lack of RTX and CUDA cores in MacBooks, which makes inference for animation creation painfully slow. Online cloud solutions are suggested as alternatives for those without access to powerful GPUs.

  • What is the potential future of VRAM requirements for AI animation?

    -The hope is that future AI animation can be done with less VRAM as technology advances, allowing for more efficient inference and less power consumption. However, high-end applications may still require significant VRAM.



πŸ˜€ Introduction to Fast Animations with Hot Shot

The speaker introduces the topic of creating fast animations using Hot Shot, a tool that allows for quick generation of animations in just four steps with the help of a model called 'lightning'. They express excitement about the quality and consistency of the animations produced, which are notably better than previous versions. The workflow is best used with input footage and depth maps, but the speaker also plans to experiment with empty latent images to see what the model can generate on its own.


πŸ“š Setting Up the Workflow and Crediting Sources

The speaker details the process of setting up the animation workflow, which includes using various tools and models such as Dina Vision XL, the four-step Laura model, and the Hot Shot animate diff model. They credit a recent workflow posted on the banad Doo server called 'vidto vid sdxl for stops lightning Laura' by Kilner Kintner and encourage the audience to join the server for more resources. The speaker also discusses the process of adding IP adapters to the model string for customization.


🎭 Experimenting with Different Input Footage

The speaker experiments with different input footage to see how the model handles various scenarios, including dancing figures and non-human subjects like car races and hovercrafts. They discuss the potential for creating interesting animations with these inputs and the limitations they encounter, such as issues with the model's handling of certain prompts and the need for the right kind of depth control net for non-human subjects.


πŸ€” Troubleshooting and Testing the Model

The speaker investigates why the model might be producing certain unexpected results, such as a character's shirt disappearing. They consider various factors, including the CFG scale and the positive/negative prompt balance. The speaker also discusses the use of an empty latent space instead of input footage to see if it results in a more robust or interesting 'dream' from the model.


πŸš€ Speed and Efficiency of Hot Shot Animations

The speaker emphasizes the speed at which Hot Shot can produce animations, noting that it takes significantly fewer steps compared to other models like animate LCM and sd15. They discuss the importance of fast turnaround in AI animation and share their excitement about the potential of using Hot Shot for quick and efficient animation creation.


πŸ’» Hardware Requirements and Limitations

The speaker talks about the hardware requirements for running the animation workflow, mentioning the amount of VRAM needed and the limitations when using certain devices like the MacBook due to the lack of CUDA cores. They suggest using online cloud solutions for those with less powerful hardware and share their thoughts on the future of VRAM needs in the context of AI and gaming.


🌟 Adding More Control with Control Nets

The speaker experiments with adding more control to the animations by using control nets, specifically depth and line art control nets. They discuss the process of using these control nets to impose a composition on the animations and the potential for creating more detailed and structured animations as a result.


🎨 Style Exploration and Influence of Prompts

The speaker explores the influence of prompts on the style of the animations, trying out different prompts to achieve a desired aesthetic, such as a dystopian wasteland or an abandoned 80s mall. They discuss the challenges of getting the model to understand and apply the style changes effectively.


πŸ” Analyzing Results and Next Steps

The speaker analyzes the results of their experiments with the Hot Shot workflow and compares it with other methods like animate diff. They discuss the limitations of Hot Shot in terms of control and variety and suggest that for more nuanced and varied animations, one might prefer using animate LCM with higher CFGs. The speaker also teases upcoming topics for future streams, including text-to-video animations and other new developments in the field.


πŸ“£ Closing Remarks and Community Engagement

The speaker concludes the session by encouraging the audience to join their community on Discord and Patreon, and to make use of the resources provided. They express gratitude for the audience's time and engagement, and they look forward to future interactions and discussions within the community.




ComfyUI is a user interface for video editing and manipulation, which is mentioned in the context of creative exploration. It is the platform where the animations are being made, and it is central to the video's theme of creating animations with specific tools and workflows.


HotShot is a motion model used for creating animations. It is significant in the video as the primary tool for generating fast animations. The script discusses how HotShot can produce surprisingly fast animations with a four-step process.


SDXL-Lightning is a term that refers to a specific model or process used in the animation workflow. It is highlighted in the video for its ability to accelerate the animation process, making it a key concept in achieving the ultra-fast animation results.

πŸ’‘VAE Decoder

VAE, or Variational Autoencoder, is a type of neural network used for generating new data that is similar to the training data. In the context of the video, the VAE Decoder is used to start with empty images and let the system 'dream' to create animations from scratch.

πŸ’‘Depth Maps

Depth maps are used in the animation process to enforce certain visual aspects. They are crucial for the workflow described in the video, where they are used in conjunction with input footage to guide the animation process.


ControlNet is a tool used to impose a composition on a piece of work by using different models trained on databases of photo information. It is used in the video to create depth and line art, which are essential for the animation's structure and style.

πŸ’‘Animate Diff

Animate Diff is a method for doing batched animations with a longer context window compared to HotShot. It is mentioned in the video as an alternative tool for creating animations, particularly when more control and detail are required.

πŸ’‘CFG Scale

CFG, or Configuration Scale, refers to the scale of the configuration in the model that determines how closely the generated content adheres to the input prompt. The video discusses the impact of the CFG scale on the quality and variety of the animations produced.

πŸ’‘BananaDoo Server

BananaDoo Server is a Discord server mentioned as a resource for the ComfyUI community, where users can access a vast amount of workflows and resources. It is highlighted as a valuable tool for those looking to expand their knowledge and skills in video animation.


VRAM, or Video RAM, is the memory used by the graphics processing unit (GPU) to store image data for rendering. It is discussed in the video in the context of the hardware requirements for running the animation workflows and the limitations it imposes on the process.


Upscaling is the process of increasing the resolution of a video or image. In the video, upscaling is discussed as a technique to improve the quality of the animations, but it also presents challenges such as introducing noise or losing details.


Live demonstration of creating animations using HotShot in ComfyUI with the SDXL-Lightning model.

Introduction of a new workflow called 'vidto vid sdxl for stops lightning' by Kilner.

Recommendation to join the Banad Doo Discord server for more resources and workflows.

Explanation of using depth maps with input footage to enhance animation quality.

Experimentation with empty latent images to generate animations without input footage.

Discussion on the potential of SDXL and its fast animation capabilities.

Technical details on setting up the node structure for the animation process.

Challenges faced with the cfg1 model ignoring negative prompts.

Demonstration of how to upscale video using specific settings to avoid noise.

Mention of the limitations of using Lightning models for commercial purposes due to licensing.

Creative exploration with non-human subjects like cars and hovercrafts in animation.

Testing different input footage and the impact on the dream-like animation outcome.

Use of the Canny pre-processor and ControlNet for creating line art and depth maps.

Observations on the performance and VRAM usage during the animation process.

Comparison between HotShot, Animate Diff, and SVD (Stable Video Diffusion) models.

Tips for achieving better results with upscaling and handling noise in animations.

The potential of using text-to-video without the need for input footage for animations.

Invitation to join weekly hangouts on Discord for community collaboration and support.