Fully correct, also the v2 of the paper introduced a model that is bigger and sl...

Fully correct, also the v2 of the paper introduced a model that is bigger and slower, however generates better images. So the 500ms was only for the first model we introduced in v1. I also want to mention our new work as it is very much related to this whole topic of "speeding up models" -> either training or sampling: Würstchen: https://github.com/dome272/wuerstchen/ With a current version we are training at the moment we can sample (using torch.compile) 4 1024x1024 images in 4 seconds. Also the training of this kind of model is very fast due to spatially encoding images much much more -> 3x512x512 images -> 16x12x12 latents => 42x spatial compression, whereas StableDiffusion has an 8x compression (3x512x512 -> 4x64x64).