The people who first originated the clip guided diffusion approach (rivershavewings around this time last year) are now working for stable diffusion so it's somewhat arguable that dalle wasn't actually first (just first to make a user friendly saas for it).
You need CLIP to have CLIP guided diffusion. So the current situation seems to trace back to OpenAI and the MIT-licensed code they released the day DALL-E was announced. I would love to be corrected if I've misunderstood the situation.
You're totally right, OpenAI released CLIP in january. But I mean CLIP isn't an image generator, it's just a classifer. If we restrict the question to actual text to image generators (ignoring deep dream or some of the 'kinda cool but far from the coherency of post-2021 generators') then clip guided diffusion is kinda the first.