Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Like diffusion but faster: The Paella model for fast image generation (deeplearning.ai)
150 points by webmaven on June 26, 2023 | hide | past | favorite | 44 comments


Note that Paella is a bit old in image model terms (Nov 2022) and modern stable diffusion tools have access to optimized workflows.

My 3060 can generate a 256x256 8 step image in 0.5 seconds, no A100 needed. A 3090 is double the performance of a 3060 at 512x512, and an A100 is 50% faster than a 3090...

If you have access to an high end consumer GPU (4090) you can generate 512x512 images in less than a second, it's reached the point that you can increase the batch size and have it show 2-4 images per prompt without adversely affecting your workflow.

Too bad SD1.5 is too small* and we'll require models with more parameters if we want a true general purpose image model. If SD1.5 was the end-game, we'd have truly instant high res image generation in just a couple more generations of GPUs, think generating images in real time as you type the prompt, or have sliders that affect the strength of certain tokens and see the effects in real time, etc. Tho I heard that SDXL is actually faster for higher resolutions (>1024x1024) due to removing attention on the first layer, making it scale better with resolution even tho SDXL has 4x the parameter size.

* Current SD1.5 models that can generate consistent high quality images have been fine-tuned and merged so many times that a lot of general knowledge has been lost, e.g. they can be great at generating landscapes, but lacking in generating humans, or they can be very good at a certain style like comics but can do comic style only and lose the ability to generate more dynamic face variations, etc.


Personally I think the SD1.5 trade off, loss of general knowledge for surprisingly high quality images on consumer hardware, is worth it.

It's fairly impressive to me what the community has made possible with SD1.5. Sure on a vanilla task something like Dall-E 2 generally performs better, but with some tweaking you can easily beat out Dall-E on a home gaming PC.

The fact that you can fine-tune SD1.5 on a 4090 is incredible to me.

Given how much powerful AI is locked behind fees and private APIs it's refreshing to see so much cool stuff coming out of the OSS world again. Best of all is it's not being driven exclusively by people with an ML background, but moreso curious amateurs. It really brings me back to a time when playing around with software/the web felt exciting.


I am really pleasantly surprised the field is kind of open and not drowning in software patents.


Fully correct, also the v2 of the paper introduced a model that is bigger and slower, however generates better images. So the 500ms was only for the first model we introduced in v1. I also want to mention our new work as it is very much related to this whole topic of "speeding up models" -> either training or sampling: Würstchen: https://github.com/dome272/wuerstchen/ With a current version we are training at the moment we can sample (using torch.compile) 4 1024x1024 images in 4 seconds. Also the training of this kind of model is very fast due to spatially encoding images much much more -> 3x512x512 images -> 16x12x12 latents => 42x spatial compression, whereas StableDiffusion has an 8x compression (3x512x512 -> 4x64x64).


All diffusion models are quite inefficent due to running in PyTorch eager mode (with torch.compile being kinda janky in practice on 2.0/2.1).

I would be more interested to see Paella vs SD running on a ML compiler framework, like TVM or AITemplate. Maybe one or the other is more amenable to optimization.


A framework could make the programmer think they are in eager mode, while actually making graphs behind the scenes.

The trick is to simply not do any calculations till the last possible moment - ie. the time the program tries to convert the finished image to a jpeg. Only at that point do you compile the graph and run the actual computation on the GPU.

You then also cache the graph, so that the compilation step can be avoided if the program tries to do the same computation again with different data.


... Kinda like torch.compile?

That approach is limited though. AITemplate and TVM take a looong time to compile and produce standalone executable files, hence the gains are much larger than torch triton.


I find you post intriguing. What would you say are the major janks with torch.compile, and what issues are addressed by TVM/AITemplate but not by torch.compile?

EDIT: If I understand correctly these libraries target deployment performance, while torch.compile is also/mostly for training performance?


- The gain in stable diffusion is modest (15%-25% last I checked?)

- Torch 2.0 only supports static inputs. In actual usage scenarios, this means frequent lengthy recompiles.

- Eventually, these recompiles will overload the compilation cache and torch.compile will stop functioning.

- Some common augmentations (like TomeSD) break compilation, force recompiles, make compilation take forever, or kill the performance gains.

- There are othdr miscellaneous bugs, like compilation freezing the Python thread and causing networking timeouts in web UIs, or errors with embeddings.

- Dynamic input in Torch 2.1 nightly fixes many of these issues, but was only maybe working a week ago? See https://github.com/pytorch/pytorch/issues/101228#issuecommen...

- TVM and AITemplate have massive performance gains. ~2x or more for AIT, not sure about an exact number for TVM.

- AIT supported dynamic input before torch.compile did, and requires no recompilation after the initial compile. Also, weights (models and LORAs) can be swapped out without a recompile.

- TVM supports very performant Vulkan inference, which would massively expand hardware compatibility.

Note that the popular SD Web UIs don't support any of this, with two exceptions I know of: VoltaML (with WIP AIT support) and the Windows DirectML fork of A1111 (which uses optimized ONNX models, I think). There is about 0% chance of ML compilation support in A1111, and the HF diffusers UIs are less bleeding edge and performance/compatibility focused.

And yes, triton torch.compile is aimed at training. There is an alternative backend (Hidet) that explicitly targets inference, but it does not work with Stable Diffusion yet.


Thanks for the info. I didn't know about the TomeSD stuff, really interesting. Why do you think that AITemplate is so much faster?


(And for reference, AITemplate roughly doubles SD 1.5's speed. Not sure if thats good or if Paella would have even more room for auto optimization).


> Current SD1.5 models that can generate consistent high quality images have been fine-tuned and merged so many times that a lot of general knowledge has been lost,

You say that as if it was a bad thing, but it's actually good: GPU memory being a limiting factor means that general knowledge is mostly overhead, it is much better to have 50 specialized model (that you can all store on disk for cheap) that each takes 5 time less GPU memory than a big general model that you'll constantly under-use but still have to load entierly in the GPU memory. And it's even more true for LLMs.


Half a second. Can't even type a good prompt in that fast.

One of my dad's preferred anecdotes about how much computers sped up in his career, was the number of digits of pi that the company mainframe could compute.

He was born in '39.

And now I can generate images from descriptions faster than I can give those descriptions.

At this rate, websites will be replaced with image generators and LLMs, and the loading speed won't change.


Just to be clear, it's half a second for 256x256, where Stable Diffusion takes 3.2 seconds. Still a great speed-up, but not producing the big hi-res images people might be thinking of.


> Just to be clear, it's half a second for 256x256

But to continue their story, a lot of us used to play games at a 256x256 resolution.

So in the grand scheme of "things improve quickly, by a lot" it very much applies.


Stable Diffusion takes 3.9 seconds to produce 8x 256x256 images (so about 0.5 seconds each), or 2 seconds to produce 1x 256x256 images. That's with DPM++ 2M Karras sampling with 20 iterations on a 3090.


I was just referring to the article with those numbers. See the "results" section, where they talk about their hardware and such, for apples-to-apples comparison.


The load speed for modern websites is mostly due to all the trackers rather than the actual UI so I don't think load speed would actually be affected at all.


At this rate, we will be replaced.


Github for those looking for the code https://github.com/dome272/Paella


Way back when, it was pretty easy to recognize diagrams created with MacDraw. There was a particular visual style to the primitives it included that flowed through to the final product. This was of course easier to notice because there were so few alternatives at the time.

Given that Paella uses tokens instead of the source image, I wonder if the results will have a (human- or machine-) detectable "style" to them.


The usual answer to "all AI art looks the same" is https://i.redd.it/jvwyyqn7776a1.jpg


Perhaps if you ask an 11 year old.


A question for curiosity. Why can't it train on 256x256 pixels yet generate any size image? So if it was trained on multiple sizes of images, could you also generate of a larger size without upscaling?


Yea you can. It's the same as with any other CNN based model that is not forced to have a specific shape (like transformers if they use specific positional embeddings). You can also look at the blog post where different resolution images are generated https://laion.ai/blog/paella/


I also explain more in the video I made about Paella (https://youtu.be/zdE1I6kYKYc). Maybe that clarifies things more.


Transformers are not forced to use a specific input (or output) shape; the original ViT paper demonstrates interpolating positional embeddings to inference with arbitrary image shapes.


As someone who doesn’t know the “why” but uses Stable diffusion a lot and has an intuitive feel for the “what” of what happens, it’s like trying to use a low res pattern for your wallpaper on Windows. It’ll either just repeat over and over so you’ll end up with weird multi headed people with heads on top of their heads, or you just upscale which hallucinates details in a totally different way that doesn’t really add new interesting details.

With automatic1111 you can get around this by upscaling then inpainting the spots you want more detail and specifying a specific prompt for that particular area.


Well, it's complicated.

The model can't just work on arbitrary image sizes because the model was trained with a fixed number of input and output neurons. For example, 512x512 is Stable Diffusion's "native size." However, there are tricks to work around this.

Diffusion models work by predicting image noise, which is then subtracted from the image iteratively until you get a result that matches the prompt. Stable Diffusion specifically has the following architectural features:

- A Variational Autoencoder (VAE) layer that encodes the 512x512 input into a 128x128 latent space[0]

- Three cross-attention blocks that take the encoded text prompt and input latent-space image, and output a downscaled image to the next layer

- A simpler downscaling block that just has a linear and convolutional layer

- Skip connections between the last four downscaling blocks and corresponding upscaling blocks that do the opposite, in the opposite order (e.g. simple upscale, then three cross-attention blocks).

- The aforementioned opposite blocks (upscale + cross-attn upscale)

- VAE decoder that goes from latent space back to a 512x512 output

At the end of this process you get what the combined model thinks is noise in the image according to the prompt you gave it. You then subtract the noise and repeat for a certain number of iterations until done. So obviously, if you wanted a smaller image, you could crop the input and output at each iteration so that the model can only draw in the 'center'.

Larger images are a bit trickier, you have to feed the image through in halves and then merge the noise predictions together before subtracting. This of course has limitations: since the model is looking at only half the image, there's nothing to steer the overall process, so it will draw things that look locally coherent but make no sense globally[1].

I suspect - as in, I'm totally guessing here - that we might be able to fix that by also running the diffusion process on a downscaled version of the image and then scaling the noise prediction back up to average with the other outputs. As far as I'm aware no SD frontends do this. But if that worked you could build up a resolution pyramid of models at different sizes taking fragments of the image and working together to denoise the image. If you were training from scratch you could even add scale and position information to the condition vector so the model can learn what image features should exist at what sizes.

[0] Think of this like if every pixel of the latent-space image was, instead of RGB, four different channels worth of information about the distribution of pixels in the color-space image. This compresses the image so that the U-Net part of the model can be architecturally simpler - in fact, lots of machine learning research is finding new ways to compress data into a smaller amount of input neurons.

[1] Moreso than diffusion models normally do


Presumably a transformer model or similar that uses positional encodings for the tokens could do that, but the U-Net decoder here uses a fixed-shape output and learns relationships between tokens (and sizes of image features) based on the positions of those tokens in a fixed-size vector. You could still apply this process convolutionally and slide the entire network around to generate an image that is an arbitrary multiple of the token size, but image content in one area of the image will only be "aware" of image content at a fixed-size neighborhood (e.g. 256x256).


This is not how it works, there is no sliding window, the model has no restrictions on the W and H dimensions, only on the C dim, so the UNet can actually be used directly on images with different aspect ratio and resolution; the attention layers pay attention to the entire image.


From Sections 3 and 4 of the VQGAN paper[1] upon this work is built: "To generate images in the megapixel regime, we ... have to work patch-wise and crop images to restrict the length of [the quantized encoding vector] s to a maximally feasible size during training. To sample images, we then use the transformer in a sliding-window manner as illustrated in Fig.3." ... "The sliding window approach introduced in Sec.3.2 enables image synthesis beyond a resolution of 256×256pixels."

From the Paella paper[2]: "Our proposal builds on the two-stage paradigm introduced by Esser et al. and consists of a Vector-quantized Generative Adversarial Network (VQGAN) for projecting the high dimensional images into a lower-dimensional latent space... [w]e use a pretrained VQGAN with an f=4 compression and a base resolution of 256×256×3, mapping the image to a latent resolution of 64×64indices." After training, in describing their token predictor architecture: "Our architecture consists of a U-Net-style encoder-decoder structure based on residual blocks,employing convolutional[sic] and attention in both, the encoder and decoder pathways."

U-Net, of course, is a convolutional neural network architecture. [3]. The "down" and "up" encoder/decoder blocks in the Paella code are batch-normed CNN layers. [4]

[1] https://arxiv.org/pdf/2012.09841.pdf [2] https://arxiv.org/pdf/2211.07292.pdf [3] https://arxiv.org/abs/1505.04597 [4] https://github.com/dome272/Paella/blob/main/src/modules.py#L...


The transformer model used in the VQGAN paper has nothing to do with the autoencoder, it has been used to predict the quantized tokens. So no, you don't need to slide a window with the unet, you can directly predict images with different aspect ratios and resolutions like with SD.


[flagged]


I think it's an analogy.

paella is a Spanish rice dish. Rice cooked with various other ingredients like chicken, seafood, peppers, tomatoes, etc...

If an image (a 2-d array of pixels) is a plate of rice, standard diffusion models denoise starting from each grain of rice (pixel). If you've ever seen a step-by-step output from a diffusion model, you know what I'm talking about.

This model makes use of a CNN (convolutional neural net) to decode tokens from the image. A CNN takes m-by-m sections of elements (i.e. a square of pixels) as input, and translates them into 1d vectors as part of the input to the NN. Taking "chunks" of the image like this allows the net to learn things like edges, shapes, groups of color, etc.

You could consider the convolutional samples as "chunks" in the paella: the meat, vegetables, and other goodies that make the dish beloved by many.


"Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting."

https://news.ycombinator.com/newsguidelines.html


I don’t understand your parenthetical reference - paella and patella are very different things…

In any case, I took Paella to be a play on infusion (used in making Paella vs diffusion.

Or maybe these guys just really like rice.


Paella is spanish food, one of the researchers is spanish and among the first things the model was good at was food.


that's the correct answer. I love the previous idea posted above, but it was really just that the model was good at food initially and Pablo is from spain, so we decided to name it after same spanish dish. We are holding up this tradition and follow-up models were called Arroz-Con-Cosas, Risotto and Würstchen.


[flagged]


If that's the reason why you're whining than I will explain a simple trick: "Paella model"


No worse than Apple computers, Internet cookies, email spam...


Their main claim of “faster” unfortunately is false.

> Running on an Nvidia A100 GPU, Paella took 0.5 seconds to produce a 256x256-pixel image in eight steps, while Stable Diffusion took 3.2 seconds

Using the latest methods (torch 2.0 compile, improved schedulers) stable diffusion only takes about 1 second to generate a 512x512 image on an a100 gpu. A 256x256 image 1/4 the size presumably takes less than half that time.

So the corrected title is “Like diffusion but slightly slower and lower quality.”


Hey, (one of the authors here). First of all the blog post is talking about the v1 of the paper, which was extremely fast, but not comparable to SD in any way. The v2 in arxiv is slower and does not achieve 0.5 seconds, but performs much better and closer to SD. So no doubt on this. But I just want to mention that you should not compare apples with oranges. Torch.compile also makes Paella much faster and using an optimized sampling pipeline it would always be faster than SD at 256x256 if you keep the conditions the same. Of course you could talk about distilling SD and then you can achieve maybe 1 step predictions etc. But you could probably do the same to Paella. I think it's important to stick with the main improvement from the paper that naive sampling can be done with much less steps when sticking to the original method, while being simple in its theory and implementation. But hey, way to go and improve on Paella in the future maybe :D


Hi! I really appreciate folks like you conducting and publishing real research. There have been a ton of companies recently which have been very rosily promoting their new models. My criticism was only to push back on overly optimistic marketing, and regret that some of it was directed at you. If you have a link to the v2 paper, would love to take a look!


to add, there's a finetuned version of Stable Diffusion 1.5 that can output 5 fps for 256x256 (0.2 seconds per image)[0]. So over 2x faster than Paella at 256x256

[0]:https://www.reddit.com/r/StableDiffusion/comments/z3m97e/min...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: