The big breakthrough behind the new model lies in how the images are generated. The first version of DALL-E used an extension of the technology behind OpenAI’s language model GPT-3, to generate images by predicting the next pixels in the image as if they were words in a sentence. This works, but not very well. “It’s not a magical experience,” Altman said. “It’s amazing that it works.”
Instead, the DALL-E 2 uses something called a diffusion model. Diffusion models are neural networks trained to clean up images by removing pixelation noise added during training. The process involves taking an image and changing a few pixels in it at a time, through many steps, until the original image is erased and you’re left with random pixels. “If you do this a thousand times, the final image looks like you pulled the antenna cable out of the TV — it’s just snow,” says Björn Ommer, who works on generative artificial intelligence at the University of Munich in Germany and helped build the Diffusion models that now support stable diffusion.
The neural network is then trained to reverse the process and predict what a less pixelated version of a given image would look like. The upshot is that if you give a diffusion model a mess of pixels, it tries to generate something cleaner. Plug the cleaned image back in and the model will produce a cleaner image. Do this enough times and the model can take you from TV snowflakes to high-resolution pictures.
AI art generators will never work exactly the way you want them to. They often produce horrible results, resembling distorted works of art at best. In my experience, the only way to really make a composition look good is to add the descriptor at the end, with a style that looks good.
~ Eric Carter
The trick with text-to-image models is that the process is guided by a language model that tries to match cues to images generated by the diffusion model. This pushes the diffusion model towards images that the language model considers a good match.
But these models don’t pull connections between text and images out of thin air. Most text-to-image models today are trained on a large dataset called LAION, which contains billions of pairs of text and images collected from the internet. This means that the image you get from a text-to-image model is a distillation of the world presented online, distorted by prejudice (and pornography).
One last thing: There is a small but important difference between the two most popular models, the DALL-E 2 and the Stable Diffusion. DALL-E 2’s diffusion model is suitable for full-size images. Stable diffusion, on the other hand, uses a technique called latent diffusion, invented by Ommer and his colleagues. It works on a compressed version of an image encoded in a neural network in a so-called latent space, where only the essential features of the image are preserved.
This means that stable diffusion requires less computing power to work. Unlike DALL-E 2, which runs on OpenAI’s powerful servers, Stable Diffusion can run on (good) personal computers. The explosion of creativity and the rapid development of new applications is due in large part to the fact that Stable Diffusion is open source—programmers are free to change it, build on it, and make money from it—and is lightweight enough for people to run at home.
redefine creativity
For some, these models are a step toward general artificial intelligence, or AGI — an overhyped buzzword referring to future artificial intelligence with general-purpose or even human-like abilities. OpenAI has always been clear about its goal of achieving AGI. For this reason, Altman doesn’t care that DALL-E 2 now competes with a plethora of similar tools, some of which are free. “We’re here to make AGI, not image generators,” he said. “It will fit into a broader product roadmap. It’s a small element of what AGI is going to do.”