Description of the problem

CLIP has a 77 token limit, which is much too small for many prompts.

Several GUIs have found a way to overcome this limit, but not the diffusers library.

The solution I'd like

I would like diffusers to be able to run longer prompts and overcome the 77 token limit of CLIP for any model, much like the AUTOMATIC1111/stable-diffusion-webui already does.

Alternatives I've considered

  • I tried reverse-engineering the prompt interpretation logic from one of the other GUIs out there (not sure which one), but I couldn't find the code responsible.

  • I tried running the BAAI/AltDiffusion in diffusers, which uses AltCLIP instead of CLIP. Since AltCLIP has a max_position_embeddings value of 514 for its text encoder instead of 77, I had hoped I could just replace the text encoder and tokenizer of my models with those of BAAI/AltDiffusion to overcome the 77 token limit, but I couldn't get the BAAI/AltDiffusion to work in diffusers

Additional context

This is how the AUTOMATIC1111 overcomes the token limit, according to their documentation :

Typing past standard 75 tokens that Stable Diffusion usually accepts increases prompt size limit from 75 to 150. Typing past that increases prompt size further. This is done by breaking the prompt into chunks of 75 tokens, processing each independently using CLIP's Transformers neural network, and then concatenating the result before feeding into the next component of stable diffusion, the Unet.

For example, a prompt with 120 tokens would be separated into two chunks: first with 75 tokens, second with 45. Both would be padded to 75 tokens and extended with start/end tokens to 77. After passing those two chunks though CLIP, we'll have two tensors with shape of (1, 77, 768). Concatenating those results in (1, 154, 768) tensor that is then passed to Unet without issue.

0

Hey @jslegers, the Long Prompt Weighting Stable Diffusion community pipeline gets rid of the 77 token limit. You can check it out here

0

@apolinario :

I have the same question / remark I made @ https://github.com/huggingface/diffusers/issues/2135.

Most people aren't going to figure out on their own that there is a dedicated pipeline to get rid of the 77 token limit. I sure wasn't able to find this info until you provided me a link... and I'm a dev with more than a decade of experience.

It's also not exactly user friendly to have a dedicated pipeline for what's a pretty important feature almost every Stable Diffusion user is likely to want (since it doesn't take much to surpass 77 tokens).

So why not just bake support for +77 tokens into StableDiffusionPipeline?

0

Hey @jslegers,

It's true that our documentation is currently lacking behind a bit. Would you be interested in contributing a doc page about long prompting?

Also note that I would suggest to just use StableDiffusionPipeline and pass the prompt_embeds manually, e.g. the following code snippet works:

from diffusers import StableDiffusionPipeline
import torch

# 1. load model
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# 2. Forward embeddings and negative embeddings through text encoder
prompt = 25 * "a photo of an astronaut riding a horse on mars"
max_length = pipe.tokenizer.model_max_length

input_ids = pipe.tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cuda")

negative_ids = pipe.tokenizer("", truncation=False, padding="max_length", max_length=input_ids.shape[-1], return_tensors="pt").input_ids                                                                                                     
negative_ids = negative_ids.to("cuda")

concat_embeds = []
neg_embeds = []
for i in range(0, input_ids.shape[-1], max_length):
    concat_embeds.append(pipe.text_encoder(input_ids[:, i: i + max_length])[0])
    neg_embeds.append(pipe.text_encoder(negative_ids[:, i: i + max_length])[0])

prompt_embeds = torch.cat(concat_embeds, dim=1)
negative_prompt_embeds = torch.cat(neg_embeds, dim=1)

# 3. Forward
image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds).images[0]
image.save("astronaut_rides_horse.png")

Could you try out whether this fits your use case? Would you be interested in adding a doc page about long-prompting maybe under: https://github.com/huggingface/diffusers/tree/main/docs/source/en/using-diffusers

1

Also note that I would suggest to just use StableDiffusionPipeline and pass the prompt_embeds manually, e.g. the following code snippet works:

[...]

Could you try out whether this fits your use case?

It's an interesting approach and definitely more in line with what I'm looking for...

I'll need to try this on my demos and test scripts before I can comment on it further, but it looks promising as an approach for at least personal use...

I'd still argue this is a bit convoluted for something that Stable Diffusion should support out of the box, but I guess that's something RunwayML and StablilityAI should fix (by replacing CLIP with an alternative that supports more tokens) and not something the diffusers library is responsible for.

Would you be interested in adding a doc page about long-prompting maybe under: https://github.com/huggingface/diffusers/tree/main/docs/source/en/using-diffusers

I'll take that into consideration, under the condition I'm allowed to post that same content on my own blog(s) as well.

I was planning to do some tutorials on how to use Stable Diffusion anyway, so I might as well make some of that content official documentation.

1

Feel free to use every content of diffusers in whatever way you like :-) It's MIT licensed

0

@patrickvonplaten

Feel free to use every content of diffusers in whatever way you like :-) It's MIT licensed

Good to know...

Wasn't sure that license applied to documentation as well.

I'm not a lawyer, and I prefer to make as little assumptions as possible when it involves legal matters...

1

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

0
© 2022 pullanswer.com - All rights reserved.