From Sections 3 and 4 of the VQGAN paper[1] upon this work is built: "To generate images in the megapixel regime, we ... have to work patch-wise and crop images to restrict the length of [the quantized encoding vector] s to a maximally feasible size during training. To sample images, we then use the transformer in a sliding-window manner as illustrated in Fig.3." ... "The sliding window approach introduced in Sec.3.2 enables image synthesis beyond a resolution of 256×256pixels."
From the Paella paper[2]:
"Our proposal builds on the two-stage paradigm introduced by Esser et al. and
consists of a Vector-quantized Generative Adversarial Network (VQGAN) for projecting the high dimensional images into a lower-dimensional latent space... [w]e use a pretrained
VQGAN with an f=4 compression and a base resolution of 256×256×3, mapping the image to a latent resolution of 64×64indices." After training, in describing their token predictor architecture: "Our architecture consists of a U-Net-style encoder-decoder structure based on residual blocks,employing convolutional[sic] and attention in both, the encoder and decoder pathways."
U-Net, of course, is a convolutional neural network architecture. [3]. The "down" and "up" encoder/decoder blocks in the Paella code are batch-normed CNN layers. [4]
The transformer model used in the VQGAN paper has nothing to do with the autoencoder, it has been used to predict the quantized tokens. So no, you don't need to slide a window with the unet, you can directly predict images with different aspect ratios and resolutions like with SD.
From the Paella paper[2]: "Our proposal builds on the two-stage paradigm introduced by Esser et al. and consists of a Vector-quantized Generative Adversarial Network (VQGAN) for projecting the high dimensional images into a lower-dimensional latent space... [w]e use a pretrained VQGAN with an f=4 compression and a base resolution of 256×256×3, mapping the image to a latent resolution of 64×64indices." After training, in describing their token predictor architecture: "Our architecture consists of a U-Net-style encoder-decoder structure based on residual blocks,employing convolutional[sic] and attention in both, the encoder and decoder pathways."
U-Net, of course, is a convolutional neural network architecture. [3]. The "down" and "up" encoder/decoder blocks in the Paella code are batch-normed CNN layers. [4]
[1] https://arxiv.org/pdf/2012.09841.pdf [2] https://arxiv.org/pdf/2211.07292.pdf [3] https://arxiv.org/abs/1505.04597 [4] https://github.com/dome272/Paella/blob/main/src/modules.py#L...