Offline AI Image Generation

On January 5, 2021, OpenAI revealed DALL-E. Frankly, it blew everyone's minds. DALL-E was a modified version of GPT-3. GPT, as you might know, is a large language model (LLM), and it generates text. But DALL-E took an input prompt and generated an image out of it. No, not searched for images across the internet, but generated a brand-new image on the fly. The demand was great—everyone had to try it. There was a wait list. It took a while to generate an image and the images weren't that great, to be honest. Fast-forward barely three years later, and you have many such text-to-image models. DALL-E has undergone a few versions already. And now there's Midjourney, Stable Diffusion, Artbreeder, Deep Dream Generator, Prisma, Craiyon, starryAI, and many more. The images have gotten so good that they're frequently indistinguishable from real images. This has created some strange problems; the phrase “seeing is believing” is no longer true.

In fact, these models have gotten so good and so accessible that you can run them offline, on a commercial off-the-shelf computer. I'm writing this article on a shiny new M4 Max, and I've tested the code on an M1 Max. It's quite realistic to generate images using text prompts with commonly available models on your laptop. By the end of this article, you'll be able to do so on your machine as well.

Pardon me, but I can barely contain my excitement. Can you imagine what this means? My thoughts turn to reality on-the-fly and I'll be the best meme maker in town. All right, more than just that. When building presentations, I could just fire up my local AI model, type in a text prompt, and have it generate a nice image for me. Why doesn't PowerPoint have such a feature? Maybe it does—if not, I'm sure that, at some point, it will.

Of course, this raises eyebrows and numerous ethical concerns. How do the artists whose images were used to create these models get paid? What implication does this have on what is trustable and what is not? Can a quick tweet on an AI-generated image create a news headline and start a war? Can images that look absolutely real be used to change the course of wars? Can you turn stills into a video, appear in a fake webcam stream, and appear as someone you're not? What does this mean for social engineering? What does this mean for human relationships? Will we all have AI generated boyfriend/girlfriends or just friends backed by an LLM, glued to our VR headsets?

Sadly, all those are absolutely real concerns, concerns that have already manifested themselves in many ways. And, as local hardware has gotten so powerful, all this is well within the reach of anyone with a credit card. Ouch!

I feel the best solution here would be a blockchain plus certificate-based distributed image verification solution, so at least we'd know if an image is trustable. Trust me, looking at an image and trusting that it's real because it's a photo is no longer enough. Don't believe me? Visit ThisPersonDoesNotExist.com and let me know what you think.

If I had enough video or picture data of you, could I generate a model of you, and basically generate any image of you, pretty much indistinguishable from reality? Sadly yes!

All right, it's best if I focus on the tech aspects of this, so back to that. In this article, I'm going to show you a few examples of applications that take in a text prompt and generate an image for you. Although I'll depend on open source or openly available models, the actual execution will be local, on my Mac. The code should be able to run with no internet connection, entirely on your laptop, as long as you have some beefy hardware. Either a higher-end MacBook pro, or a Windows or Linux machine with a high-end Nvidia card will do.

Why Offline?

You might be wondering: Why bother doing this offline? There are models, such as Midjourney, that are available as a service for as low as $10 a month. Why not just use them?

There are some very good reasons to do this offline. First, I'm cheap. I don't want to pay $10 a month. If you do this seriously at a commercial scale, it'll cost you more than that. But either way, being cheap is a pretty good reason.

Second, security and privacy are important. Your data stays local, so you don't have to upload any sensitive information anywhere. What you generate is your business. An offshoot of text-to-image generation is image-to-image generation, where you can upload an image that you have and use that as an inspiration to generate more images. You wouldn't want to upload your images to some random service whose data residency requirements don't agree with your terms, right? You do read those terms, right?

Third, the ability to run this offline means you can run this anywhere. Afterall, the ping times between Earth and Mars are anywhere between eight and 16 minutes. I wouldn't want to wait eight minutes for an image to generate only to realize that I want to tweak it and generate another image and wait another eight minutes. Jokes aside, image generation is still compute-intensive, given today's hardware. Typically, when we're trying to get the perfect image, we input and experiment with many parameters, and prompts, and refiners, to get the image just right. It takes iterations, and we may want to move fast with a lower resolution picture and then increase the resolution as we get closer to the result we want. Any serious image generation shop will need something like this.

The code I'll show you runs on my local M4 Max, and each image is generated within seconds. You could easily chain up a bunch of M4 Pros, for example, or similar Nvidia-based hardware in a cheap server farm, and accelerate this process for real-world applications. Imagine if you had a creative team on staff. They could issue prompts and generate images on the fly, supporting a team of five to ten people, on a server farm costing maybe USD 10K, and very low power consumption. This is the realistic value you can unlock with today's hardware and I have no doubts it'll only get better.

The fourth reason is creative freedom. Many online image-generation AI models have been accused of bias. I won't point out specific reasons, but let's just agree that bias exists. So does censorship. What if I want to generate images that suit my needs and my taste without some big company (or even government) telling me what I can or cannot do. All right, please don't get me in trouble on this, or get yourself in trouble on this, but if you generate problematic or misleading content, it still wouldn't be cool, and you will probably face the consequences for it. But still, let's say you were directing a film based on World War II, and you needed to generate an image of Germany during 1942. Obviously, such an image would have questionable artifacts that perhaps many online solutions will block for good reason. But for your use case, you should be able to generate something like this offline, right? As long as it's put to a moral use. Unfortunately, there are plenty of NSFW models available easily as well, but we'll keep it work friendly, all right?

The fifth reason is commercial applicability. Many online-generated images are not quite what you're looking for. Either their image styles are too generic, or they have their own personality and therefore aren't unique enough for your brand, or maybe they're watermarked, or come with restrictions, or they're not sufficient resolution for the specific task you're attempting. What if you wanted to print a poster, and you needed a 6000px by 6000px AI-generated image?

Okay, let's think something crazy. What if you were directing a film based in the year 48288 AD, and you wanted to generate a futuristic landscape projected on curved super high-resolution screens. Imagine the cost savings of being able to generate an AI image and project it on a screen that surrounds the actors. The actors wouldn't need to even pretend or act. And it'll save so much editing time. No more green screens, no more expensive sets, but you will need super high-resolution images.

Finally, you may want to fine-tune an image generation model with your own images. Imagine you're a Hollywood producer, and for some exceptionally risky scenes, you don't want to risk even a stunt man. Could you create an AI version of an extremely overpaid Hollywood actor, and have the AI version do all the dangerous stuff? This requires you to create an AI model specific to that actor. What would that mean for news readers? See this video x.com/andrewrsorkin/status/1856849559756181904. Let me know what you think.

Frankly, why do we even need actors at that point?

There are many other such reasons. But let's just say that enough reasons exist to try this. So let's try it.

Text-to-Image Models

The specific kinds of models you'll use for this article are text-to-image models. As the name suggests, you give it an input text prompt and it'll generate an image for you. In contrast, there are image-to-image models as well, where you can start by using an existing image and making tweaks to it. For instance, you could generate a cartoon version of my mugshot.

Let's understand a few basic concepts. First, what is a diffusion model? A diffusion model is a type of generative model that learns to represent data as a Markov chain that progressively adds noise to the input data until it becomes a random sample from a known distribution (e.g., Gaussian noise). The model then learns to reverse this diffusion process to generate new samples that are similar to the original data. So, between forward diffusion, denoising, and reverse diffusion, it's able to help you do things like image generation, image-to-image translation, or data manipulation, like adding missing details into a picture. Text-to-image models, like latent diffusion models, combine an LLM that transforms an input text into a latent representation and a generative image model, which is then able to generate an image.

As it turns out, there are many AI models that can run offline, such as Stable Diffusion, DALL-E Mini (Craiyon), a limited version of Midjourney, etc. Some of these models are available as applications, and sometimes those apps work only on mobile devices. I wanted to pick a model that gave me the most freedom. My requirements were that I wanted the most flexibility in picking the image generation model. Maybe even extend it with more models. I wanted the ability to fine-tune the model with my images. I wanted to use a model that could generate incredibly high-resolution images, and I wanted to do this and much more through custom code.

What I really wanted was my own web application where I can input a text prompt and generate an image within my corporate network (or, for now, my home network). Let's see how far we can take this.

After some homework, I decided to go with Stable Diffusion for this article. I'm not saying this is the best possible model—there were so many other choices at https://huggingface.co/models?pipeline_tag=text-to-image&sort=trending. But after some high-level research, it seems that Stable Diffusion comes in various sizes so I could size it to my hardware. I don't have a server farm, just one laptop, so I need to be able to run this low for dev purposes and scale it high for production. I wanted a model that has decent support and community interest, that performs reasonably fast on reasonable hardware, and that allows me to tweak input enough to generate exactly what I wish. Most of all, I wanted it to be free, which Stable Diffusion seems to be.

Stable Diffusion

Stable Diffusion is a deep learning text-to-image model. It's built by a company called Stability AI. It can be used for text-to-image applications, but also for other scenarios, such as image-to-image, inpainting, outpainting, etc. Stable Diffusion was trained on an open-source dataset of images at www.laion.ai. Given that the input dataset had relatively lower resolution images of 512x512, Stable Diffusion excels at generating images around that resolution. In real-world applications, you can use Stable Diffusion to generate such images, and once you're happy with the image, you can use one of many upscaling models to get a higher-resolution image if you wish. Although I won't cover upscaling in this article, you're welcome to check out https://openmodeldb.info/models/4x-Nomos8kDAT as one of my favorite upscaling models. As subsequent versions of Stable Diffusion have rolled out, Stable Diffusion can generate good images at 1024x1024.

Stable Diffusion may also suffer from not being able to generate certain details accurately. Notably, it may inaccurately generate limbs. You can avoid this, to a great extent, by using the right prompts. But it's also not uncommon to use refiner models to get further details on certain aspects of the image. For instance, you can use dedicated models to refine skin to make it look more realistic, or face features, or limbs, or really anything.

A Simple Image Generation Application

Okay, enough talk. Let's write some code. The model I decided to go with for my local application can be found at https://huggingface.co/stabilityai/stable-diffusion-3-medium. To use this model, you'll have to create an account at Hugging Face, agree to the terms of usage, and share your contact information. I'm no lawyer, but it seems like this model is free for research, non-commercial, or even commercial usage, as long as your revenue is less than $1M per year. I think I'm pretty safe there. You can read their license here https://stability.ai/license. If your revenue is > $1M, well, congratulations. You can reach out to Stability AI for a commercial enterprise license.

I intend to use Stable Diffusion with diffusers. Diffusers are a type of neural network component used in deep learning models, particularly in the field of computer vision and image processing. They're a key component of diffusion models, which I mentioned earlier. Diffusers are used in image generation models, such as generative adversarial networks (GANs) and variational autoencoders (VAEs), to generate new images.

To get started, set up a Python project, with a virtual environment targeting Python 3x. Go ahead and create a requirements.txt with the following code in it.

torch
diffusers
transformers
protobuf
sentencepiece
accelerate

Then, in your .venv terminal, run the following command to install the necessary packages.

pip install -r requirements.txt

Next, create a file called index.py with the code shown in Listing 1. Let's understand this code a bit better.

Listing 1: The simple image generation code

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype=torch.float16
)
pipe.to("mps")

image = pipe(
    prompt="A cartoon image of yellow dog sitting on a chair, drinking coffee 
      with fire all around him, saying this is fine",
    negative_prompt="",
    num_inference_steps=28,
    height=512,
    width=512,
    guidance_scale=7.0,
).images[0]

image.save("output.png")

The first thing you do is create a diffusion pipeline using StableDiffusion3Pipeline.from_pretrained. A diffusion pipeline is a series of processing steps used in diffusion models to progressively refine and generate data, such as images. You're creating one from a pre-trained model here. Could you use other pre-trained models? More on that later.

A diffusion pipeline is a series of processing steps used in diffusion models to progressively refine and generate data, such as images.

Next, use the pipeline to generate an image. When you generate an image, you can pass in several parameters, many of them optional. Here's where you have to play around with the right prompts and parameters to get the right result. At the very least, you need to give it a prompt, i.e., what you're asking it to generate an image of. You can also pass in negative prompts, which are instructions to the model to avoid generating certain kinds of images. For example, a negative prompt could be “No gore,” to avoid gory images. Then you have num_inferencing_steps, which is the number of denoising steps. More denoising steps usually mean a better-quality image at the expense of slower inference. Height and width are pretty obvious parameters. And finally, there's guidance_scale. Higher guidance scale encourages it to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.

Finally, to get debugging to work in VSCode, create a .vscode folder and create a launch.json file with the code shown in Listing 2.

Listing 2: The launch.json for debugging support

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python Debugger: Current File",
            "type": "debugpy",
            "request": "launch",
            "program": "${file}",
            "console": "integratedTerminal"
        }
    ]
}

That's basically it. Now hit F5, and, in almost no time, you should see an image generated in the output.png file.

The prompt I used was:

A cartoon image of yellow dog sitting on a chair, drinking coffee with fire all around him, saying “This is fine.”

The image I was able to generate can be seen in Figure 1.

I LOL'ed big time when I saw that picture. I mean, look at how the dog is holding the cup. Okay, AI did its best. Seriously, how else would a dog hold a cup? That's a bit like how a dog would wear pants right? All four legs or only rear paws? Also, dogs don't drink coffee—coffee is poisonous for dogs.

Let's just agree that this code is impressive, but also, that this is not so fine and it needs improvement. Can our very simple model generate somewhat realistic pictures?

Change the prompt to:

Smiling woman in turquoise silk curtains. Fall sunlight. Professional photography.

I got the image you can see in Figure 2.

Figure 2: Trying to produce a more realistic image

My first reaction: ehh! Too bad it's just AI!

Yeah, the picture looks a bit more photo-realistic, and I can sort of tell it is a person. But what's up with her left arm? The face looks a bit deformed. The skin looks artificially smooth and those curtains look haphazardly arranged. It's impressive for a computer program to generate this, but I'm not fooling anyone with this. It's generated by AI. It almost looks like a poor photoshop. You can keep playing with the prompts, inference steps, guidance, etc., but I wasn't able to produce an image that I could show someone without them being able to tell whether this is AI or not. In fact, try generating pictures of celebrities, like Satya Nadella or Taylor Swift. It generates a picture that kind of looks similar to them, but it's clearly not a picture of the person in question. Maybe that's by design. After all, we just went through an election in the U.S., and there were plenty of AI-generated pictures floating around. Thankfully, for most of them, you could tell they are just AI.

I want to explore this a bit further, so I headed over to https://civitai.com.

Civitai is a website where users can share models, images, videos, articles, and all-around image and video generation. There are many models to pick from and many of them are uncensored. Contributors share their work and frequently explain how they arrived at a certain result.

I could take an existing model and fine tune it, but a lot of that work has already been done for me in the various models available in Civitai. As you explore Civitai, you'll find models that are appropriate for various purposes. Some are trained on a particular celebrity. Some are great for generating scenery. My goal was to generate realistic images, so this AI model (of two people) caught my eye: https://civitai.com/models/4201/realistic-vision-v60-b1. The same models are also available on Hugging Face if you prefer.

Although you can download the model directly from Civitai, in order to use it with a diffusion pipeline, it's going to need some prep work.

First go ahead and download the model of your choice. In my case, I downloaded realistic-vision v5.1 Hyper (VAE). This downloaded model is a safetensors file. You need to convert it to a format that can be used with diffusers. Thankfully, Hugging Face provides a script to do that at https://raw.githubusercontent.com/huggingface/diffusers/v0.20.0/scripts/convert_original_stable_diffusion_to_diffusers.py. Download this file and place it in a folder. This file depends on the same Python packages that your project currently does, so I suggest that you create a folder called Models in your same project and place this file there.

To convert the safetensors file to diffusers, run the following command:

python convert_original_stable_diffusion_to_diffusers.py --checkpoint_path
  realisticVisionV60B1_v51HyperVAE.safetensors --dump_path realistic/ 
  --from_safetensors

This will take only a few moments to process. Verify that a folder, as shown in Figure 3, now appears in your project under the Models folder.

Figure 3: Safetensors converted to diffusers

Great. Now you're ready to consume this model using a diffusion pipeline.

Modify the code as shown in Listing 3. Go ahead and run the code. The prompt is the same as before:

Listing 3: Use the realistic model in a diffusion pipeline

import torch
import safetensors
import transformers
import diffusers

model_path = "models/realistic"

pipe = diffusers.DiffusionPipeline.from_pretrained(
    model_path, torch_dtype=torch.float16, safety_checker=None, 
    use_safetensors=False
)
pipe.to("mps")

image = pipe(
    prompt="Smiling woman in turquose silk curtains. fall sunlight. 
    professional photography",
    num_inference_steps=30,
    height=768,
    width=512,
    guidance_scale=1.5,
    seed=1876016,
).images[0]
image.save("output1.png")

Smiling woman in turquoise silk curtains. Fall sunlight. Professional photography.

Now hold your breath. Your image, although similar, may not be identical. The image I was able to generate can be seen in Figure 4.

Figure 4: Image generated with the same prompt using realistic model

My first reaction: WOW! Too bad it's just AI!

I mean seriously. Does anyone have her number? She doesn't have a ring either. I am frankly blown away at the quality of this image. But even this can get better!

You can choose to upscale this model to a higher resolution. This article is getting a bit long, but you can use an upscaler like https://github.com/zhengchen1999/DAT to upscale this generated image. I went ahead and upscaled this model to 8x, and although the image was still pretty good, I could tell that there was some artifacting going on. Figure 5 shows it upscaled to 8x and a crop of just her lips.

Figure 5: Upscaled to 8x and cropped part of the image.

As you can see in Figure 5, the image is still extremely good, but let's be honest, nobody has lips that smooth, especially not in winter. Here you can use extensions for Stable Diffusion, like ADetailer (https://github.com/Bing-su/adetailer), to further work on features such as hands and eyes and so on. With that, you can take a generated image, detect features, and enhance portions of it to be way more realistic.

Remember that earlier I said I was unable to generate photorealistic pictures of celebrities? Maybe that's by design. After all, I don't want to get sued for likeness reasons. But if you're following my article along and have the code working, go ahead and try generating a picture of Taylor Swift. How about Donald Trump? How about Joe Biden? How about Einstein? Now do you believe me? Those pictures look completely indistinguishable from reality. Actually, let me generate a picture of someone who (hopefully) won't mind.

The prompt I used was:

Abraham Lincoln riding a motorbike.

The image I was able to generate can be seen in Figure 6.

Figure 6: Honest Abe was into motorbikes—maybe!

Jaw drop! I had no idea our ex-presidents were such daredevils. But even that image looks like a painting, doesn't it? The issue is that the input data for all pictures of Abraham Lincoln were similar in details. You could fine tune the model and get better results, of course.

Now, I'm not going to push my luck by generating a picture of a modern celebrity who has plenty of pictures available, someone like say Taylor Swift, Donald Trump, or Kamala Harris. But believe me, the generated pictures are frequently indistinguishable from real. And for those who don't have plenty of pictures, you can fine tune with an input set of images.

Putting It All Together

Now that you have the basic code working, can you put this together into a working website where a user can give a prompt and other input parameters, and the server crunches it up to generate a model and show it to the user?

Let's do it!

In the project you were working on, edit the requirements.txt and add the following packages:

flask
flask_executor

In this same project, create a folder called Templates and drop an index.html file in there. Also create a folder called Static where generated images will go.

The intention is to allow the user to enter an input prompt and click Generate Image, the server does its AI magic, and when an image is ready, it's shown to the user. Of course, you can make this as compelling as you wish, but to keep things to the point, I won't focus on things like a beautiful user interface, etc.

Let's get started with the index.html first. The index.html is going to leverage jQuery, and it will have a very simple user interface: a text box to receive the user's prompt and a button to start the process of image generation on the server. Then, the code will poll the server every five seconds to see if the image is ready or not. As soon as the image is ready, it will be shown to the user. The full code can be seen in Listing 4. Note that I've taken some shortcuts besides the ugly UX. I'm not validating inputs, I'm not allowing the user to specify height/width, etc. But this isn't production code; I'm trying to stay to the point and address things pertinent to the task.

Listing 4: The index.html

<!DOCTYPE html>
<html>
  <head>
    <title>Offline AI Image generation</title>
    <script src="https://code.jquery.com/jquery-3.6.0.min.js";></script>
  </head>
  <body>
    <h1>Offline AI Image generation</h1>
    Enter prompt:
    <input text="Natural scenery" id="generation-prompt" />
    <button id="start-generation">Generate Image</button>
    <div id="generation-status"></div>
    <img id="generation-image" src="" />
    <script>
        $(document).ready(function () {
            $('#start-generation').click(function () {
                // Start the generation
                $('#generation-status').text("generating image");
                $.ajax({
                    type: 'POST',
                    url: '/start_generation',
                    contentType: 'application/json',
                    data: JSON.stringify({ 'generation-prompt':
                      $('#generation-prompt').val() }),
                    success: function (data) {
                        $('#generation-status').text(data.message);
                        // Check every 5 seconds
                        let intervalId = setInterval(function () {
                            $.ajax({
                                type: 'GET',
                                url: '/check_generation',
                                success: function (data) {
                                    if (data.message === 'generation complete') 
                                    {
                                        // Display the image
                                        $('#generation-image').attr('src', 
                                          data.image);
                                        clearInterval(intervalId);
                                    } else {
                                        $('#generation-status').text(
                                            data.message);
                                    }
                                }
                            });
                        }, 5000);
                    }
                });
            });
        });
    </script>
  </body>
</html>

Now let's focus on the server-side code. On the server-side code, you need to turn the console application into a flask application. The idea is that it will serve index.html. It will serve an image.png from the static folder when an image is ready. And it will expose three APIs/routes.

The first route will be start_generation, which is callable over POST, as you can see in Listing 4. This accepts an input parameter that is the prompt.

The second route will be check_generation, which the client will call every five seconds to see if the image is ready.

The third route will be /image, where the generated image will be served.

Let's get started.

First, I separated out my image-generation code and parameterized it, as can be seen in Listing 5. This is almost exactly the same code as you've already seen in Listing 3. The only difference is that the prompt is parameterized. You can make this more flexible by allowing more input parameters, and perhaps even letting the user pick more than one model.

Listing 5: Image generation code

def generate_image(generation_prompt):
    model_path = "models/realistic"

    pipe = diffusers.DiffusionPipeline.from_pretrained(
        model_path, torch_dtype=torch.float16, safety_checker=None, 
        use_safetensors=False
    )
    pipe.to("mps")

    image = pipe(
        prompt=generation_prompt,
        negative_prompt="",
        num_inference_steps=30,
        height=768,
        width=512,
        guidance_scale=1.5,
        seed=1876016,
    ).images[0]
    image.save("static/image.png")

    # Return the filename
    return 'static/image.png'

Next is the code to start image generation, as can be seen in Listing 6. This code deletes a previously generated file, if it exists, and kick starts the image generation process by calling the generate_image function with the user-supplied input prompt. When the image is generated, save it as static\image.png.

Listing 6: The start generation route

@app.route('/start_generation', methods=['POST'])
def start_generation():
    if os.path.exists('static/image.png'): os.remove('static/image.png')
    
    generation_prompt = request.json['generation-prompt']
    
    executor.submit(generate_image(generation_prompt))
    
    return jsonify({'message': 'generating image'})

In Listing 7, you can see the check_generation route. Here, you simply check to see if the file is present, which means that generation is complete. If the file isn't present, it returns generation running: otherwise it returns generation complete with the name of the generated image. This code is intentionally simple. In the real world, you'll have an authenticated application, tie this to a user's session, and generate unique filenames for every execution. You might even have a clean-up process to remove images older than a certain date. Those are typical in any production application.

Listing 7: The check generation route

@app.route('/check_generation')
def check_generation():
    # Check if the image file exists
    if os.path.exists('static/image.png'):
        return jsonify({'message': 'generation complete', 
                        'image': 'static/image.png'})
    else:
        return jsonify({'message': 'generation running'})

Finally, you have the image route, where you serve the image from the generated file in the static folder. This can be seen in Listing 8.

Listing 8: The serving image route

@app.route('/image')
def serve_image(): return url_for('static', filename='image.png')

Of course, you'll need a route to serve index.html from the templates folder, which can be seen in Listing 9.

Listing 9: Default route to serve index.html

@app.route('/')
def index(): return render_template('index.html')

The full server-side code put together can be seen in Listing 10.

Listing 10: The full index.py for server side code

from flask import Flask, render_template, jsonify, url_for, request
from flask_executor import Executor
from PIL import Image, ImageDraw
import os
import torch
import safetensors
import transformers
import diffusers

app = Flask(__name__)
executor = Executor(app)

# Generate the image
def generate_image(generation_prompt):
    model_path = "models/realistic"
    pipe = diffusers.DiffusionPipeline.from_pretrained(
        model_path, torch_dtype=torch.float16, 
        safety_checker=None, use_safetensors=False
    )
    pipe.to("mps")
    image = pipe(
        prompt=generation_prompt,
        negative_prompt="",
        num_inference_steps=30,
        height=768,
        width=512,
        guidance_scale=1.5,
        seed=1876016
    ).images[0]
    image.save("static/image.png")
    # Return the filename
    return 'static/image.png'

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/start_generation', methods=['POST'])
def start_generation():
    if os.path.exists('static/image.png'):
        os.remove('static/image.png')
    generation_prompt = request.json['generation-prompt']
    executor.submit(generate_image, generation_prompt)
    return jsonify({'message': 'generating image'})

@app.route('/check_generation')
def check_generation():
    # Check if the image file exists
    if os.path.exists('static/image.png'):
        return jsonify({'message': 'generation complete', 
                        'image': 'static/image.png'})
    else:
        return jsonify({'message': 'generation running'})

@app.route('/image')
def serve_image():
    return url_for('static', filename='image.png')

if __name__ == '__main__': app.run(debug=True)

Your project is already setup for debugging, so hit F5 to start debugging. As soon as you hit F5, you should see a message, like below, in VSCode's terminal:

* Running on http://127.0.0.1:5000
Press CTRL+C to quit
 * Restarting with stat
 * Debugger is active!
 * Debugger PIN: 327-338-769

Now open the browser and visit the http://127.0.01:5000 URL and you should see a web page, as shown in Figure 7.

Ooh, so exciting! Let's enter a prompt. I used the following prompt:

A cat wearing a Batman suit at sunset, fighting crime in New York, hyper-realistic.

Creative enough? You feel free to try whatever you wish! My crime fighting kitty can be seen in Figure 8.

Figure 8: Purr-man to the rescue! Protecting NYC from evil, one catnap at a time.

Haha! This is fun. Let's try another prompt. This time, let's not go realistic. Let's go artistic.

Watermelon has arms and legs and is running in a cartoon.

The results can be seen in Figure 9.

Honestly, I could play with this all day, and there's so much more you can add here, things I didn't even get a chance to touch upon. For instance, what if you could show the user not one but four generated images from the input prompt? Yep, that's easy: Notice that the generated image is an array. There are four images there, so try it!

Then you can take an image as an inspiration and ask the user to tweak it. For instance, that watermelon looks cool, but turn it into a Picasso painting, and that would be image-to-image generation.

What about chaining multiple models together to perform automated tasks, like enhance certain details, upscale, and well, what about video?

Can I take an input image, and say, turn it into a short video, and give an inspiration to the input? Yeah, all that is possible!

Summary

Get on the AI bandwagon. Not because you want to be an AI engineer, but because you want to remain relevant. The superpowers you discover will make you 1000x more productive. I'll never bother to learn regex now or internalize complex Git commands. LLMs figure this out for me.

Tell me what real-world tasks you'd like to see AI solve for you. I'd like to keep a developer focus, and ideally target solutions that don't require subscriptions or swiping credit cards.

Hey, PowerPoint! Where is my AI copilot for memes?

More next time. Until then, happy coding!