Guides

Understanding Guidance Scale in Stable Diffusion: A Beginner's Guide

Guidance Scale also known as the Classifier-Free Guidance scale, it controls how closely Stable Diffusion adheres to the text prompt. Essentially, it shapes how much the generated image mirrors the input text.

Shanmukha Karthik

15 Mar 2024 • 5 min read

Stable Diffusion generated art is a fascinating field where artificial intelligence is used to create stunning and unique pieces of art. One of the key parameters that influence the outcome of this process is the Guidance Scale.

The Guidance Scale, also known as the Classifier-Free Guidance (CFG) scale, controls how closely Stable Diffusion adheres to the provided text prompt during the image generation process. In other words, it determines the extent to which the generated image reflects the input text.

Impact of Guidance Scale on Image Quality

A higher value of the Guidance Scale means that the AI will follow the text prompt more strictly. This can be useful when you want the generated art to closely match a specific description. However, a higher Guidance Scale also restricts the AI’s creative freedom, which might result in less diverse and potentially lower-quality images. On the other hand, a lower value of the Guidance Scale gives the AI more freedom to interpret the text prompt creatively. This can lead to more diverse and unexpected results, which might be desirable in certain contexts

How to Optimize the Guidance Scale?

It’s important to strike a balance. If the Guidance Scale is too low, the AI might produce images that bear little resemblance to the text prompt. If it’s too high, the images might lack creativity and appear too literal or constrained.

The guidance scale can range from 1 to 20. At the extremes, the text prompt is ignored when the guidance scale value is set to 1, and it’s strictly followed with a maximum of 20 but with worse image quality

How to Choose the Right Guidance Scale?

With the guidance scale serving as the control mechanism ,the most ‘creative’ and ‘artistic’ results are usually generated around a guidance scale of 7-12. But using a scale up to 15 still produces results with little to no artifacts.

The recommended guidance scale value is typically between 7-9. You can increase it when the generated image does not follow the prompt.

If you are trying to generate an image with more tiny details specified in the prompt, you can start with a higher guidance scale between 12 and 16.

Setting the right value for the Guidance Scale often involves some trial and error, and the optimal value can vary depending on the complexity of the text prompt and the desired level of creativity in the output. It’s always best to experiment with different scales to see what works best for your specific use case

Comparing Results with Different Guidance Scale Settings

Let us generate few images , where we will get to define the prompt and then check how various values of guidance scale impacts the image generation process.

Take this example where we would like to generate the image of a handsome male bounty hunter dressed in the Victorian era style.

Prompt: male human bounty hunter treasure seeker, with a pistol in his right hand, fantasy, Victorian era

Different images generated for respective guidance scale values of 5, 9 and 12

For the first image that was generated , the bounty hunter prompt has resulted in a character that has a rugged, intense expression and is depicted wearing an intricate outfit consisting of layers of metal armor plates, straps, pouches, and protective gear. While the prompt renders a fictional image , the lower guidance scale of 5 that was used results in an image that doesn't fully adhere to the visual style suggested in the prompt .

We can see that the final image generated with the guidance scale value of 12 adheres to the prompt completely. The color palette is predominantly muted shades of brown, black, and gray, contributing to the overall gritty and weathered visual style.

Let us now take the example of Cherry bonsai tree, we expect the image to have vibrant colors consisting of pastel shades in it.

Prompt : cherry blossoms, bonsai, Japanese style landscape , high resolution , 8k , lush greens in the background .

Different images generated for respective guidance scale values of 5, 9 and 13.

For the first image , the guidance scale value was set at 5. We can see the generated image consists of the sculpted bonsai tree in the foreground while surrounded by the sight of lake or river. This can be attributed due to low value that was set for the guidance scale parameter effectively not adhering to the prompt completely.

The second image that was generated using this prompt was done by setting the value of the parameter around 9. This image is slightly varied from the previous image with changes including the composition and the texture of the bonsai plant.

The last image was created by setting the value of the parameter at 13. The overall composition of the image consists of all the blend of the Japanese elements including pagoda style roofs of the buildings present in the background situated among the lush greenery.

For the final prompt example , let us try to generate an image in the style of tin tin comics consisting of lots of people

Prompt:detailed pen and ink illustration of a New York neighborhood by Herge, in the style of tin-tin comics, vibrant colors, detailed, lots of people, night time

Images rendered for guidance scale values of 4, 9, 14

The impact of low guidance scale value set for the first image is reflected in its appearence. We can see the prompt given to be partially followed, where it depicts a densely packed cityscape however the vibrant colors that was specified in the prompt isn't that reflective. The sheer density and variety of architectural styles create a sense of organized chaos, capturing the essence of a thriving and diverse urban environment.

The second and the third images generated are way different from the first image not only in the composition style but also in the usage of colors. The foreground is filled with a lively street scene, featuring vintage cars, storefronts, and crowds of pedestrians going about their daily activities.Buildings present in the image are depicted in bold, vibrant colors contrasting the cool tones of the sky.

Conclusion

The role of the Guidance Scale in Stable Diffusion generated art is to help control the balance between adherence to the text prompt and creative freedom. This allows for a wide range of outcomes, from strictly literal interpretations to more abstract and imaginative renditions, depending on the desired result. It’s a powerful tool that artists can use to guide the AI in creating art that aligns with their vision.