SD3 Medium Base Model in ComfyUI: Not as Good as Expected – Better to Wait for Fine-Tuned Versions

黎黎原上咩
13 Jun 202407:39

TLDRStability AI's SD3, initially met with high expectations, has faced setbacks including leadership changes and financial struggles. Despite these, it was released as scheduled, featuring photorealism and improved text generation capabilities. The SD3 Medium model, requiring additional clip downloads, shows promise but has flaws, particularly in generating human figures. The community awaits fine-tuned versions for better performance, with the future hinging on third-party model adoption.

Takeaways

  • 😀 Stability AI announced the release of SD3, a new major version expected to be widely used.
  • 😔 The company faced leadership changes and financial difficulties, leading to concerns about SD3's future.
  • 📅 SD3 was officially open-sourced and released on June 12th, as scheduled.
  • 🖼️ SD3 showcases excellent photorealistic effects and adherence to complex prompts.
  • 📝 Improvements in text generation are evident, with no artifacts or spelling errors in the examples provided.
  • 🔧 The new architecture used by SD3 is the multimodal diffusion Transformer (DIT), which contributes to its advantages.
  • 🔗 The official recommendation for using SD3 is through ComfyUI.
  • 📚 Three model files were released, with the smallest being SD3 Medium at 4.34 GB, requiring separate CLIP downloads for ComfyUI.
  • 💻 Users need to upgrade ComfyUI to the latest version for SD3 support.
  • 🐑 SD3 demonstrates good understanding of spatial relationships and text prompts, as shown in the sheep with a yellow hat example.
  • 😞 However, SD3 has flaws, particularly in generating human figures, which has been a point of complaint.
  • 🔮 The future of SD3 depends on the adoption of third-party models and tools like ControlNet, with the hope for fine-tuned versions soon.

Q & A

  • What was the initial anticipation for SD3 based on previous versions?

    -The initial anticipation for SD3 was that it could be another major version like SD 1.5 and SDXL, expected to be widely used and highly anticipated.

  • What challenges did Stability AI face leading up to the release of SD3?

    -Stability AI faced several challenges including the resignation of the company's founder and CEO Emad Moake, the departure of the core research team, and funding difficulties due to their free open-source business model which put the company's financial situation in jeopardy.

  • When was SD3 officially released by Stability AI?

    -SD3 was officially released by Stability AI on June 12th.

  • What are the notable capabilities showcased in the initial images of SD3?

    -The initial images of SD3 showcased its excellent photorealistic effect, adherence to complex prompts involving spatial relationships, compositional elements, actions, and styles, and an evident improvement in text generation without artifacts or spelling errors.

  • What is the multimodal diffusion Transformer (DIT) and why is it significant for SD3?

    -The multimodal diffusion Transformer (DIT) is the new architecture used by SD3. It is significant because it is responsible for the advantages mentioned, such as photorealism and prompt adherence.

  • How many model files were released for SD3 and what are their sizes?

    -Three model files were released for SD3: SD3 medium at 4.34 GB, SD3 uncore mediumcore at 5.97 GB, and the largest package SD3 uncore mediumcore including _ Clips t5x XL fp8.

  • What is the recommended software to use with the released SD3 model?

    -The official recommendation for using the released SD3 model is ComfyUI.

  • What are the hardware requirements for using SD3 in ComfyUI?

    -Users should ensure they have a graphics card with sufficient VRM to handle the model and clips, as the maximum usage was observed to be around 15.2 GB, which is roughly the size of the model plus three clips.

  • What was the outcome when testing SD3's text generation ability with a specific prompt?

    -When testing SD3's text generation ability with the prompt of a sheep with a yellow hat that says 'Mimi', SD3 correctly wrote the nickname on the hat, demonstrating its text generation capability.

  • What are some of the performance issues reported with SD3?

    -There have been complaints about SD3's poor performance in generating human figures, with the results being described as scary or broken, even with different seeds.

  • What is the future outlook for SD3 and what factors will influence it?

    -The future of SD3 depends on the adoption and speed of third-party models, fine-tuning, and developments in control mechanisms like Laura and ControlNet.

  • Why can't the Pony series model author adapt to SD3?

    -The Pony series model author, Asly Har, confirmed that they cannot adapt to SD3 due to license issues.

Outlines

00:00

🚀 Launch of Stability AI's SD3 Model

Stability AI faced significant setbacks with the resignation of its CEO and research team, leading to funding issues and doubts about the release of SD3. However, the company overcame these challenges and officially launched SD3 on June 12th as promised. The new model, SD3, offers enhanced photorealism, improved prompt adherence, and better text generation capabilities. It utilizes a multimodal diffusion transformer architecture called DIT, which is responsible for its advanced features. The video script guides viewers through downloading and installing SD3 using Comfy UI, comparing its image quality to MJ, and discussing the hardware requirements. The script also mentions the different model files available, with the smallest being SD3 medium at 4.34 GB, requiring separate clip downloads for use in Comfy UI.

05:09

🐑 Testing SD3's Features and Limitations

The video script continues with a hands-on demonstration of SD3's capabilities, including its text generation feature, which successfully generated an image of a sheep with a hat labeled 'Mimi'. However, it also highlights SD3's limitations, particularly in generating human figures, which resulted in images that were not satisfactory even after multiple attempts with different seeds. The script also touches on the complexity of the workflow, such as the use of negative prompts and the integration of CLIP text and code with SD3's three separate prompt fields. Despite its flaws, the video expresses hope for future fine-tuned versions of SD3 and mentions the importance of third-party models, Laura, and control net for its development. The video concludes with a note of caution regarding licensing issues that prevent the adaptation of certain models, such as the Pony series, to SD3.

Mindmap

Keywords

💡SD3

SD3 refers to a major version update of an AI model, which is expected to bring significant improvements over its predecessors. In the video, it is mentioned as being highly anticipated but has faced setbacks such as the resignation of key team members and funding difficulties. The script discusses the release and features of SD3, indicating its importance to the video's theme of AI advancements.

💡ComfyUI

ComfyUI is the user interface recommended for using the SD3 model. It is mentioned as the platform through which the video explores the features of SD3, highlighting its usability and the user experience when interacting with the new AI model.

💡Photorealistic effect

The term 'photorealistic effect' is used to describe the quality of images generated by SD3, emphasizing its ability to create realistic visuals. The script provides examples of real-world images that are 'completely photo level realistic,' showcasing this feature of SD3.

💡Prompt adherence

Prompt adherence refers to the AI model's ability to accurately interpret and generate images based on textual descriptions provided to it. The script mentions that SD3 can understand complex prompts involving spatial relationships, compositional elements, actions, and styles, which is crucial for its image generation capabilities.

💡Text generation

Text generation is a feature of SD3 that has seen improvement, as evidenced by the script's mention of open-source text on a flag and graffiti without errors. This ability is important for the model's versatility in creating content beyond just images.

💡Multimodal diffusion Transformer (DIT)

The Multimodal diffusion Transformer (DIT) is the new architecture used by SD3, which is responsible for its enhanced capabilities. Although the technical details are not discussed in the script, it is mentioned as the underlying technology that enables the advantages of SD3.

💡Checkpoints

Checkpoints in the context of the script refer to different versions or stages of the SD3 model. There are three mentioned in the script, each with varying sizes and features, which are essential for understanding the different options available to users when downloading and using SD3.

💡Hardware requirements

Hardware requirements pertain to the system specifications needed to run the SD3 model effectively. The script discusses the memory usage and the recommendation to avoid certain features if the user's graphics card has low VRM, indicating the importance of adequate hardware for optimal performance.

💡Fine-tuned versions

Fine-tuned versions suggest customized or optimized iterations of the base SD3 model that may offer better performance or specific features. The script speculates that these versions may become available in the future, indicating a potential for improvement beyond the current release.

💡Third-party models

Third-party models refer to versions of SD3 that may be developed or adapted by other entities outside the original development team. The script suggests that the future success of SD3 could depend on the adoption and speed at which these models are created and integrated.

💡License issues

License issues are mentioned in the context of the inability of a specific model author to adapt their work to SD3 due to legal or regulatory restrictions. This highlights potential challenges in the widespread adoption and customization of AI models like SD3.

Highlights

Stability AI announced the release of SD3 in February, expected to be a major version like SD 1.5 and SDXL.

The company faced leadership and team changes, with the CEO stepping down and the core research team resigning.

Funding difficulties arose due to a free open-source business model, putting the company's financial situation in jeopardy.

SD3 was officially released on June 12th as scheduled, despite the challenges.

SD3 Medium Model showcases excellent photorealistic effects with completely photo-level realism.

The model demonstrates prompt adherence, understanding complex prompts involving spatial relationships and compositional elements.

Text generation in SD3 has improved, with no artifacts or spelling errors in generated text.

The new architecture used by SD3 is the multimodal diffusion Transformer (DIT), responsible for its advantages.

Comfy UI is the official recommendation for using the SD3 model.

Three checkpoints of the SD3 model were released, with the smallest being 4.34 GB and requiring separate clip downloads.

The largest model includes all necessary components, making it the 'Supreme full package.'

Comfy UI was updated to support SD3, and users are advised to upgrade for compatibility.

The SD3 model's VM usage peaks at 15.2 GB, roughly the size of the model plus three clips.

Users with low VR graphics cards are advised not to enable T5 for optimal performance.

SD3 has been criticized for its poor performance in generating human figures.

The model shows an understanding of spatial relationships and prompts, as demonstrated in image generation.

SD3's future depends on the adoption and speed of third-party models, such as Laura and ControlNet.

Due to license issues, the author of the Pony series model, Asly Har, confirmed they cannot adapt to SD3.