3Dfusion

Deep Learning

The focus of our project is to reimplement stable diffusion. We will be reimplementing a Variational Autoencoder, UNet, and pretrained CLIP within our stable diffusion model in TensorFlow. We aim to generate high-resolution images conditioned on a text prompt using language model embeddings. This has many applications in high-quality content creation, such as ads, posters, and illustrations. Our challenge is to optimize this architecture for the limited computational resources we have at our disposal. Our work is inspired by the paper "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding," which presents Imagen, a text-to-image diffusion model that generates photos with an unprecedented degree of photorealism. We chose this paper to challenge ourselves to see how much of what they describe can be implemented with limited hardware, and we were interested in seeing what dataset we could successfully train such an architecture on, given that diffusion is typically considered a data-demanding architecture. We chose to reimplement stable diffusion instead of a simple diffusion model because we find it more interesting to apply the diffusion process over a lower-dimensional latent space.

Github Code