An overview and list of free resources to help you get started.

Easiest way to play with these models is to sign up for a Google Colab account.

Couple ways to generate an image from text

  1. Inference from a model trained for this task
  2. Guiding an image representation with CLIP

All the generated images below are from prompt Winds of Winter

Model Inference

Very quick and easy to generate lots of images. It's recommended to generate 100 or so images using model as generation is quite cheap and later rank them using CLIP to get a good dozen generations.


One of the best text to image model available right now.



Pretty good text to image model. Only Older cogview model has been released publicly. You can try newer model which is faster and better only at online demo.


Online Demo:


They have trained 2 models, XL and XXL. Only XL has been released so far which has 1.3 billion params. Pretty amazing results, similar to cogview 2.

They also released Ru-CLIP.



Dall.E Mini



Clip Conditioned Decision Transformer

Colab by Rivers Have Wings:

Guiding Image Representation with CLIP

It's more like model training rather than model inference. As expected, it takes a while to generate an image.

Image representation can be as simple as RGB array, set of bezier curves to latent representations of various AEs/GANs like VQGAN, BigGAN, StyleGAN, OpenAI dVAE.

RGB Optimization

This doesn't need much GPU VRAM, works really well on just 4GB of it. Good for getting high resolution outputs.

PyramidVisions Colab

Vector Strokes Optimization

CLIPDraw by Kevin Frans



OpenAI dVAE Optimization

Colab of Implementation by Rivers Have Wings:


Rivers Have Wings coded up an implementation which connected VQGAN with CLIP. Almost every other VQGAN CLIP notebook is derived from it. Most derivatives have different learning rate strategies, image augmentations, optimizers and generally results are much better.


Implementation by Rivers Have Wings:

Implementation by crimeacs#8222:

Diffusion Models

Much better global coherency compared to CLIP guided GANs. Generation may not follow prompt as well as CLIP guided GAN's do.

Colab Implementation by nshepperd:

We have an app which provides an easy way to play with text to image algorithms. If you would like to give it a go, sign up at and send us your sign up email at We'll add couple free credits to your account to try it out.