Probably the Best Model of 2023 So Far.

Sebastian Kamph
23 Oct 202314:16

TLDRThe speaker discusses their experience with a new AI model, Think Diffusion XL, which they find superior to their previous favorite, Juggernaut. They highlight the model's extensive training with over 10,000 hand-captioned images and its ability to generate realistic images. The speaker also shares their process of testing the model with various prompts and compares it to other models, noting its superior color and detail. They encourage users to experiment with the model and share their experiences.


  • 🌟 The speaker has discovered a new favorite AI model that excels in producing realistic images, surpassing their previous favorite, the Juggernaut variants.
  • πŸ” The new model, Think Diffusion XL, has been trained on over 10,000 hand-captioned images, which is significantly more than the average model's training data.
  • πŸ’° The speaker has been sponsored by the creators of Think Diffusion XL and has been testing the model extensively.
  • 🎨 The model is capable of generating images in various art styles and realism, with a 4K dataset that enhances the quality of the outputs.
  • πŸ“Έ The speaker emphasizes the importance of human-tagged training data to improve the accuracy and reduce potential errors in the model's learning process.
  • πŸ–ŒοΈ The speaker demonstrates the model's capabilities by generating images with different prompts, showcasing its versatility and quality.
  • πŸ•ΆοΈ In one example, the speaker successfully generates a detailed image of a woman with sunglasses in a cyberpunk scene, emphasizing the model's ability to handle close-ups and skin textures realistically.
  • πŸ‘½ When generating alien and fantasy warrior images, the speaker notes that certain styles, like 'cinematic', can influence the output, sometimes overriding other prompt aspects like color.
  • πŸ‘οΈ Prompting for specific features, such as 'blue eyes', can lead to more accurate and realistic results compared to generic prompts.
  • 🎨 The speaker suggests using additional tools like 'Automatic 1111' for further refinement of the images to add details and enhance certain aspects of the characters.
  • πŸ”„ The speaker concludes by comparing Think Diffusion XL to other models, highlighting its less saturated and more realistic output, making it preferable for those seeking a more cinematic and true-to-life style.

Q & A

  • What is the speaker's new favorite model that they discuss in the video?

    -The speaker's new favorite model is Think Diffusion XL, which they mention has been trained further than the Juggernaut variants and has more input images.

  • How does the speaker evaluate the realism of AI-generated images?

    -The speaker evaluates the realism of AI-generated images by looking at close-up details, such as skin texture, and comparing it to human-generated art. They mention that they strive for the most realistic images possible and that Think Diffusion XL has produced images that they would not have guessed were AI-generated.

  • What is the significance of the hand-captioned training images mentioned in the script?

    -The hand-captioned training images are significant because they ensure that the model is trained on accurate and relevant data. Each image has been tagged by hand, which helps the model understand the keywords associated with the images and improves the quality of the generated content based on the prompts provided by users.

  • How does the speaker describe the training data set of Think Diffusion XL compared to the average model?

    -The speaker describes the training data set of Think Diffusion XL as being larger and more detailed than the average model. It includes over 10,000 images, compared to the 1,000 to 2,000 images used by the average model, and features a 4K data set, which is not common among most models.

  • What are some of the unique features of Think Diffusion XL that the speaker highlights?

    -The speaker highlights several unique features of Think Diffusion XL, including its large training data set, the hand-captioned images, the ability to generate 4K images, and its capability to be trained on all art styles and realism. Additionally, the model does not require a refiner and does not use censored images, which is a plus for creating professional-looking content.

  • How does the speaker demonstrate the versatility of Think Diffusion XL in creating different styles of images?

    -The speaker demonstrates the versatility of Think Diffusion XL by using various prompts to generate images in different styles, such as a woman's close-up portrait in a cyberpunk scene, an alien warrior, and a fantasy warrior in an epic battle. They also experiment with different prompt combinations and show how the model can adjust to produce a range of visual effects.

  • What is the speaker's strategy for improving the quality of AI-generated images?

    -The speaker suggests several strategies for improving the quality of AI-generated images. These include using specific prompts for desired features (like 'blue eyes'), adjusting the clip skip value for more variation, and using other tools like 'automatic 1111' to refine and add details to the generated images.

  • What are the speaker's thoughts on the use of cinematic style in AI-generated images?

    -The speaker appreciates the use of cinematic style in AI-generated images as it provides a more desaturated and color-graded look, similar to high-production films. They mention that this style can make the images appear more realistic and visually appealing, especially when using prompts that are meant to create a cinematic vibe.

  • How does the speaker address the issue of prompts overriding certain style elements?

    -The speaker notes that sometimes specific prompts can override the style elements that they want to include in the image. For example, they mention that using the 'cinematic' style might override the vibrant colors they are trying to achieve. To fix this, they suggest adjusting the prompts or trying different combinations to get the desired result.

  • What is the speaker's final verdict on Think Diffusion XL compared to their previous favorite models?

    -The speaker concludes that Think Diffusion XL is a very good model and potentially their new favorite, as it provides realistic and high-quality images without an overly saturated or plastic feel that is prevalent in other models. They invite their audience to try it out and share their preferences or recommendations for other models.

  • How does the speaker's experience with Think Diffusion XL compare to the results from other models like Juggernaut and Dream Shaper?

    -The speaker mentions that while they have used Juggernaut and other models for different purposes, Think Diffusion XL stands out due to its realistic output and versatility. They note that the comparisons on Think Diffusion's own page show that it is less desaturated and glossy compared to models like Stxl, providing a more muted color palette for realism while still being able to produce vibrant images when prompted with words like 'cinematic'.



🎨 Introduction to a New AI Model

The speaker introduces a new favorite AI model for generating images, emphasizing its superiority over previous models like the Juggernaut variants. This new model has been trained further, with more input images, and is praised for its realistic image generation capabilities. The speaker discusses the challenges of achieving realism in AI-generated art and highlights the importance of human-tagged training data for accurate model performance. The script mentions specific features of the model, such as its training on over 10,000 hand-captioned images and its ability to generate high-resolution 4K images. The speaker also discloses being sponsored by the model's creators but asserts that their positive opinion is genuine.


🌌 Exploring Cinematic and Alien Imagery

The speaker delves into the use of the AI model for creating cinematic and alien-themed images. They discuss the impact of different styles on the output, such as the cinematic style which tends to produce more desaturated and color-graded images. The speaker experiments with various prompts, including 'alien warrior close-up portraits' and 'fantasy warrior in epic battle,' to demonstrate the model's versatility. They also touch on the importance of refining prompts and adjusting settings, such as the clip skip value, to achieve better results. The speaker shares their satisfaction with the generated images, noting the realistic skin textures and detailed features like eyes.


🏹 Fine-Tuning Image Prompts and Styles

In this paragraph, the speaker continues to experiment with the AI model, focusing on fine-tuning prompts and styles to achieve desired outcomes. They discuss the process of adding specific details to prompts, such as 'flowing magic light' and 'digital art style,' to enhance the imagery. The speaker also explores the effects of removing certain styles and adjusting settings like the cinematic film still and HDR vibrant color to refine the images. They share their personal preferences for certain styles and settings, and invite the audience to share their own experiences and preferences with the model. The speaker concludes by comparing the new model to others like Juggernaut and dream shaper, and highlights the Think Diffusion model's ability to produce realistic images without an overly saturated or plastic feel.



πŸ’‘AI-generated images

AI-generated images refer to visual content created by artificial intelligence algorithms, without human intervention. In the context of the video, the speaker is discussing their experience with a new AI model that can produce highly realistic images, which they consider an advancement in the field of AI-generated art. The speaker is impressed by the model's ability to create images that closely resemble real-life textures and details, such as skin and hair, which they demonstrate through various examples.


Realism in art refers to the accurate and true-to-life representation of subjects. In the video, the speaker emphasizes their preference for AI models that can produce images with a high degree of realism, which they consider a challenging aspect of AI-generated art. The speaker evaluates the new AI model based on its ability to create realistic images, particularly focusing on the depiction of human features and scenes.

πŸ’‘Juggernaut variants

Juggernaut variants refer to a set of AI models that the speaker previously favored for their ability to produce high-quality images. In the context of the video, the speaker is comparing these models to a new AI model, which they find to be even better in terms of realism and image quality. The mention of Juggernaut variants serves as a benchmark for the speaker's evaluation of the new AI model.

πŸ’‘Think Diffusion XL

Think Diffusion XL is the name of the new AI model that the speaker is exploring in the video. It is characterized by its extensive training with over 10,000 hand-captioned images, which allows it to generate a wide range of art styles and realistic images. The speaker has been testing this model thoroughly and is impressed with its capabilities, particularly its ability to produce high-resolution images with a cinematic style.


Prompting in the context of AI-generated art refers to the process of providing specific keywords or phrases to guide the AI in creating an image. The speaker discusses the importance of effective prompting to achieve desired outcomes, such as specifying certain styles or characteristics. The speaker shares their experience with different prompts and how they can influence the final image produced by the AI model.

πŸ’‘Cinematic style

Cinematic style in the context of AI-generated images refers to a visual aesthetic that mimics the look and feel of film, often characterized by a more desaturated and color-graded appearance. The speaker appreciates this style for its ability to produce images that look more realistic and high-quality, similar to what one might see in a movie. The speaker also notes that prompting for 'cinematic' can result in a more desaturated look, which they find appealing.

πŸ’‘4K data set

A 4K data set refers to a collection of images used for training an AI model with a resolution of 4K, which is four times the resolution of standard 1080p images. In the video, the speaker mentions that the new AI model has been trained on a 4K data set, which contributes to the high level of detail and quality in the images it generates. This high-resolution training data allows the AI to produce more intricate and realistic visual content.

πŸ’‘Human tagging

Human tagging involves individuals manually labeling or categorizing images with keywords or descriptions. This process is crucial for training AI models to recognize and generate specific visual elements based on the tags. In the video, the speaker highlights that the AI model's training images were hand-captioned and tagged by humans, which helps the model to understand and produce images that align with the tags, leading to more accurate and relevant outputs.

πŸ’‘Automatic 1111

Automatic 1111 seems to be a feature or tool mentioned by the speaker that possibly enhances or refines AI-generated images. Although not explicitly defined in the transcript, it is implied that this tool or feature allows for further customization or improvement of the images created by the AI model. The speaker uses it to adjust and experiment with the AI-generated content to achieve a desired aesthetic or level of detail.

πŸ’‘Face paintings

Face paintings in this context refer to the visual elements of the AI-generated images, specifically the depiction of facial features and any artistic enhancements such as paint or markings. The speaker is interested in the AI model's ability to accurately and creatively represent face paintings, which adds to the realism and artistic quality of the generated images. They experiment with prompts that include face paintings to evaluate the model's performance in this area.

πŸ’‘Color grading

Color grading is the process of altering and enhancing the colors in an image or video to achieve a specific visual style or mood. In the video, the speaker discusses how the AI model can produce images with a cinematic style, which often involves color grading to create a more desaturated and film-like appearance. The speaker appreciates this feature, as it contributes to the realism and high production value of the AI-generated images.


The speaker has found a new favorite model that surpasses the Juggernaut variants in their opinion.

The new model has been trained further than Juggernaut and has more input images.

The speaker values realistic images and believes the new model gets them closer to achieving true realism.

The model uses over 10,000 hand-captioned and tagged images for training, which helps with accurate prompting and training data.

The model has been tested thoroughly by the speaker, who has had access to it for quite some time.

The speaker mentions that the model is paid for and sponsored but emphasizes that their positive opinion is genuine.

The model is trained for all art styles and realism with a 4K data set.

The average model uses 1,000 to 2,000 training images and 1.8 million training steps, whereas the new model has even more extensive training.

The speaker demonstrates the model's capabilities by generating various images with different prompts and styles.

The speaker notes that specific prompts, like 'sunglasses at night,' can produce interesting and creative results.

The speaker discusses the importance of human-tagged training data in reducing potential errors from computer tagging.

The new model does not require a refiner and is capable of producing high-quality images straight out of the box.

The speaker shares tips on how to improve prompts and achieve better results, such as specifying eye colors or adjusting the clip skip value.

The speaker compares the new model's output to other models like Juggernaut and dream shaper, noting the differences in saturation and realism.

The speaker concludes that the new model, Think Diffusion, provides a very realistic experience without an overly saturated plastic feel.

The speaker invites feedback and suggestions from the audience, showing openness to exploring other models.

The speaker emphasizes the preference for a cinematic, more realistic style, and shares their satisfaction with the model's performance.