In the summer of 2022, I used Midjourney to generate my first artificial intelligence image. It was a clockwork hand. I did, without irony, choose the very thing that Generative AI was terrible at. It returned some images that vaguely resembled hands. These weren't just any hands, of course. They were abominations in the eyes of God. It was accidental Salvatore Dali, a warped thing with wobbly appendages. Vague suggestions of numberless clock faces emerged from this generation. Despite the crude response, I was fascinated and immediately set to spending way too much money on hundreds of pictures that were completely unusable for any purpose other than to say, “Hey, isn't this weird and cool?” Take a look at the monstrosity in Figure 1.

This was just with Midjourney 2.0. Other iterations followed, and they came quickly. Through the updates, the hand started to take shape. It took a few years to get the appropriate number of fingers worked out, but all along, I watched the technology change in real time. With every upgrade, my clockwork hand looked more and more like I imagined. I'm a little slow on the uptake sometimes, but around Midjourney 3.0, the thing that had been nagging at me since the beginning finally revealed itself. If you could make an image of something, you could get Generative AI to make lots of those images in a row, and if you wanted to make lots of images in a row, you could pack them in tightly—24 images every second, maybe.

The Myth of Filmmaking

Growing up, filmmaking was some mystical art. It was an occulted thing, a special club for people with exciting lives, like astronauts and karate champions. Only rich people and wizards got to make movies. Then film gave way to digital. We started to have affordable, non-linear editing systems in our homes and the “prosumer” camera market exploded. In 2001, Danny Boyle directed the zombie flick 28 Days Later on one such camera, the Canon XL1. I bought my own right after that for around $1500. As you can see in Figure 2, it was portable enough to carry in one hand and it used tape storage instead of film.

Figure 2: The Canon XL1 was groundbreaking.

Before long, tapes were swapped out for digital storage. The plunging cost per compute put complex FX systems into the hands of hobbyists. Blender, an open-source 3D creation suite you can see in Figure 3, supports everything from modeling and animation to compositing and video editing. And it's free. The idea of filmmaking as an exclusive club started to erode. In 2015, filmmaker Sean Baker shot the award-winning Tangerine entirely on an iPhone 5S. Just a few years later, Steven Soderbergh used an iPhone 7 Plus to make the thriller Unsane. Movies are shot on phones all the time now, with built-in cameras that outperform the XL1. Filmmaking has been democratized, in many ways. Aspiring filmmakers have near limitless tools available.

A New Toy in the Toybox

2023 saw the mainstream adoption of primitive video generation. Diffusion models were improving rapidly, and open-source frameworks proliferated. Earlier that year, Runway ML introduced video-to-video transformations with Gen-1. Gen-1 was one of the first tools that let neophytes with no expertise in coding or filmmaking generate video content, all packaged in a user-friendly tool. Then the dam broke. Developers like Kaiber and Pika Labs introduced advanced workflows to the ecosystem. Just a year after Gen-1, Runway introduced Gen-2, providing actual text-to-video capabilities. After much anticipation, Sora (by OpenAI) hit, only to be quickly eclipsed by Google's Veo 3 model.

Veo 3 is Google DeepMind's latest and most sophisticated video model. Although it's capable of impressive text-to-video, Google also implemented a suite of other perks, giving users more control. With these new features, users can generate native audio, achieve greater consistency from shot-to-shot, and use prompting for shot composition. Since launch, Google has integrated Veo 3 into their own AI filmmaking suite Flow, as well as Gemini, their primary LLM.

How Does It Work?

To generate small clips of video, Veo 3 accepts multimodal inputs from text prompts, images, or frames of videos. It's built using a diffusion model, which functions by generating data such as images, audio, or text through a two-step process. In the forward process, the model progressively adds noise to the image. It does this over and over until the data becomes pure noise. This intentional muddying of the information instructs the model on how a clean piece of data can transition into noisy data. At this point, you enter the reverse process, where the model learns, step by step, to strip the noise from the noisy data. It does this in iterations until it rebuilds the original data or, in this case, generates something entirely new from the noise. Through extensive and repetitive training, the model learns about the underlying distribution of the data. This enables it to generate new images tabula rasa by reversing the noise process.

Hands On

Access to the tool is available with a subscription to Google AI Pro. If you're hoping to just get the tool a la carte, you may be disappointed. The bundle includes an entire host of other options you can take or leave, like Gemini 2.5 Pro, supplemental storage, Notebook LM, and more. The basic tier is $19.99 per month, which is market competitive. There's a pricier tier with higher monthly generation limits, if you're ready to drop $250 per month. With the standard package, you get 1000 AI Credits. The per-generation cost can be anywhere from 10 credits to 100, depending on speed, quality, and model (Veo 2 vs. Veo 3). For more control, I access Veo 3 through their simplified editing system, Flow. As you see in Figure 4, the interface is stripped down and simple.

Figure 4: At first glance, Veo 3 is bare bones.

Once a project is created, the rest of the interface is spare and clean. This isn't a full non-linear editing system like Final Cut or Adobe Premiere. You'll find more complex options on TikTok or Instagram. The primary interface at the bottom of the screen puts all the focus on the prompt, shown in Figure 5. It's your standard text prompt, little different from the design used by ChatGPT for Sora or Runway ML's interface.

Figure 5: The Veo 3 prompt window is focused on simplicity.

The prompt window provides a few simple options. The Settings section in the upper right allows the user to choose the number outputs per prompt, from one to four instances, as well as the model choice. Veo 2 costs fewer tokens but is less advanced and offers no native sound. Veo 3 has a Fast and a Quality setting. To give you an idea of how you're spending your credits, the Fast setting costs 20 AI credits while the Quality setting costs a hefty 100 credits. If you're not ready to drop a considerable amount of cash, the 1000-credit limit to this plan is quite a barrier.

Users can dive right in with a simple Text to Video prompt, but for more granular control, the Frames to Video option accepts two keyframes. A starting frame and an ending frame can be uploaded (.png, .jpg, etc.) or generated via AI using the tool. Additionally, this function comes with a buffet of camera directions made available as simple icons, seen in Figure 6.

Figure 6: Frames to Video offers all the standard camera movements.

Prompting in Veo 3 is where you get to try to flex your directing muscles. For the best results, prompt it like you're sitting in the director's chair. Specificity is key. Name your subject, what they're doing, and where it's happening, along with how it feels, the style of the piece. The model responds well to the use of cinematic language. Craft your prompt with everything like camera movement, lighting, and style in mind, providing references when possible (Stanley Kubrick, 70s telefilm, found footage). Intentional and detailed writing skills will go much farther than a generic suggestion. Stay away from vague prompts, don't abuse adjectives, and make sure every noun has a verb. Treat each prompt like the beats for a scene.

After much consideration, I craft the following:

[AI Query]

Dolly in on an old wizard in his shadowy lair as he examines his own 
steampunk clockwork hand. The mechanical hand opens and closes.

The First Attempt

Within seconds, Veo 3's Fast setting gives me two options, as seen in Figure 7. And I'm kind of stunned. I've performed this exercise with other tools like Sora, but it never resulted in anything this complex and accurate. The model followed my instructions perfectly. From the jump, the results blew my expectations out of the water. It understood the assignment in every way, adding mysterious sigils and a sparkling cauldron to the image.

With image prompting, the first pass is usually a question of horseshoes and hand grenades. With every iteration, I'm often asking myself, “Is this close enough?” It's rare that it spits out exactly what I want, requiring prompting revisions and uploaded images to influence the result. Can I craft the perfect query, coaxing the model to do exactly what I'm envisioning? Yes, but it will take some effort, time, and money. That's the nature of prompt engineering, refining to find the just the right words that allow you to communicate your message perfectly to the Large Language Model. If you're trying to craft something with very specific details, that amount of trial and error can get expensive.

Basic Editing Tools

Once I've chosen my preferred output, I can add this beat to a greater scene with the click of a button in the Scenebuilder tool. For clarity, Flow is Google's platform through which you access models like Veo 3. Scenebuilder is the editing tool inside Flow. This ports the eight-second clip over into a rudimentary editing system, as seen in Figure 8. The options here are incredibly limited. The sliding bar over the video allows me to shorten it as much as I like but not make any specific cuts. If your project requires more flexibility—and it likely does—stick with your favorite non-linear editing system or grab something free, like CapCut or DaVinci Resolve. With Veo 3, as with Open AI's Sora, Google isn't trying to compete in the professional editing space. These options are functional, but the main focus is to offer an uncluttered look at what the model is capable of.

Figure 8: Flow's Scenebuilder editing interface leaves lots of room for growth.

Adding another beat to the sequence is as easy as generating the first, but for this entry, I'll try the Frames to Video feature. This will require two keyframes: one for the beginning of the clip and one for where it will end up. Rather than providing an uploaded example from the internet, I go ahead and use the tool to generate them directly. Keep in mind that the description you provide for this part isn't for video. This is just the starting frame, so prompt accordingly.

[AI Query]

An old door on the cliffside of a snowy mountain. It is carved with magical 
sigils, the entrance to a wizard's lair.

Veo 3 comes back with two fine options, as seen in Figure 9 and Figure 10. The results for these still images are perfect, aligning with the arcane touches of the first video. The generic quality of the results is something to consider. Part of the reason the output here is so good is that we're dealing with the median. AI excels at presenting archetypes. When I ask for an old wizard, it comes directly from the mold of GENERIC OLD WIZARD. Good art direction with specific, focused human effort is still required to make your project stand out from the infinitely expanding pool of slop. That takes more work, time, and effort. It needs your touch—your special sauce—to make it something new and different.

Figure 9: Here's a door, a mountain, and the arcane magics.

Figure 10: Is this the back door to Ironforge? Or a secret path through Caradhras?

Let's go with the first one from Figure 9. It's a tiny bit spookier. For the ending keyframe, I want the wizard standing on the cliffside, casting a mighty spell. Sounds simple enough, right? But here's where I'm afraid it may get tricky. With this new generation, will the wizard look the same? I need consistency, which has been an ongoing issue with earlier video models. Faces will shift. Limbs will twitch. Maybe an entirely different wizard person will emerge. I hope it's not that Potter nerd.

[AI Query]

An old wizard with a steampunk clockwork hand stands on a snowy 
cliffside, holding his hand out to cast a powerful spell into the sky.

I believe this is what we call “the other shoe dropping.” Veo 3 was on such a roll. The wizard looks the same-ish. He's still an old guy with a white beard. In Figure 11 and Figure 12, you can see where things start to go awry. Here begins the game of horseshoes. Is this close enough? No, I don't think so. I'd like to spend a few more credits to see if I can coax something better out of it.

Figure 11: This feels like it's back in Midjourney territory.

Figure 12: I don't think that's a 5E spell.

[AI Query]

An old wizard with a steampunk clockwork hand stands on a snowy 
cliffside, holding his hand out to cast a powerful spell at the camera.

The change in the prompt was subtle, nudging the model to closer to what I want. One of them still has that video game sheen to it, while the other looks decidedly more photo-real, as you can see in Figure 13 and Figure 14. Neither of them is exactly what I want. Without the ability to upload a reference image, the model starts from scratch with the old wizard. The wizard from the first clip is nowhere to be found. If I were so inclined, I could port them into a more powerful AI editing tool and make tweaks within the image. Many Gen AI image tools offer in-painting, where specific areas of an image can be altered via targeted prompting. For the purposes of this exercise, however, I'll just go with Figure 13 as the final frame of this beat.

Figure 14: Veo 3 took the “clockwork hand” a little too literally.

Now that I have my keyframes, I can tell Veo 3 what I want to happen and instruct the camera in which direction to move, if I want a dynamic shot. The transition will move the video from Figure 9 to Figure 13. That's usually where you see a lot of artifacts and shenanigans, things shapeshifting on the fly, and the world becoming rubbery and amorphous.

[AI Query]

The door opens. An old wizard with a clockwork steampunk hand steps out 
onto the snowy cliffside. He raises his clockwork hand to the camera, 
casting a powerful spell into the air.

I select Dolly Out for the camera movement and cross my fingers, but I'm immediately met with an error. As of this experiment, the camera movement buttons only work with Veo 2. Once I go ahead with the lesser model, Google serves up its first disappointment, as you can see in Figure 15.

Figure 15: Inconsistent styles within the same image are a big fat fail.

Veo 2 drops the ball. The door doesn't open. The photo-real Not-Gandalf slides unnaturally across the image, holding his glowing hand out like he's begging for wizard drugs. It looks ridiculous. Everything people hate about AI video manifests within this eight-second clip. It may just be the limitations of Veo 2, which I'm having trouble reconciling. Why switch to the lesser model for a major feature of your product?

Rather than wipe the entire effort, I go back and start with the door. I like my magic door. It can stay. The prompt remains the same, but I'll only provide the starting keyframe. Trying to force it to go from one keyframe and end up in the next is causing it to contort in ways that make it useless. Figure 16 shows you the folly of my ways.

Figure 16: Now it looks like something from a bad Full Motion Video game from the 90s.

In Veo 2's defense, the walking animation on the Warlock Rick Rubin is pretty good. Everything else, like the way he just kind of squeezes through a crack in the door, is terrible. As bad as these images look, the videos are worse, but that doesn't mean the keyframe method is worthless. Far from it. If you use a frame from the original video as the starting point, you can extend the original by another eight seconds and maintain consistency of the images.

The end of the initial clip can be seen in Figure 17. That's the starting point for the new clip.

Figure 17: This will serve as the new starting keyframe.

[AI Query]

The old wizard stands, turns, and strides through a door out onto a 
snowy cliffside.

The results are much better! It looks like the same wizard this time, exiting the same lab I had before. Now when I say “the same” here, I mean “pretty similar.” The details shift as I move from one clip to the next and the ambient noise is inconsistent throughout the sequence. With the second clip, I can hear the room tone drop out, immediately replaced with alpine winds. There's a very clear warning in the interface that apologizes for errors with the native sound generation. At this stage in the product's life cycle, that's not a dealbreaker.

Now that my spellcaster has stepped out into the snow, I have to see if I can kick the action up a notch. The end of the second clip gives the beginning of the third clip in the sequence, which you can see in Figure 18.

Figure 18: It's not quite John Ford, but I like it.

[AI Query]

The old wizard steps out into the snow, raises his clockwork hand to 
the sky, and casts a powerful spell.

It works brilliantly. The third clip is generated as instructed. The clockwork hand isn't the same one it designed before, but as the gears and cogs spin, so too does a swirl of sparkling magic, as seen in Figure 19. The spell crackles in the audio. Now I have a 24-second sequence. If you're anything like me, you know what the wizard needs to do next. He needs to blow something up with that magic he's working.

The wizard needs a formidable foe, so let's go big. After all, the small $20/month movie budget is going a long way already. Because this will be a different shot that won't continue from the wizard's path, I can't use anything from the sequence as a starting reference frame. From here, I switch back to Text to Video.

[AI Query]

Wide shot of a clockwork titan that looms over the snowy mountains. 
Something on the ground far below gets its attention, so the 200-foot-tall 
behemoth turns to look.

As seen in Figure 20, this new character suits my needs pretty well. It's a colossus of cogs and gears, billowing smoke high above the frosty landscape. It moves as I asked. More or less. Within tolerance, anyway.

Figure 20: It's the kind of enemy that will make D&D players hate you.

Once I get a grasp of Scenebuilder and its limitations, I go all in, burning through credits in a frenzy. Every clip adds eight seconds to the sequence. I trim them and move them around on the timeline, grabbing stills from each one to use as my next starting keyframe.

The magic from the wizard's spell distorts the sky with arcane energies. It arcs and blazes with power.
A storm of purple and blue lightning billows through the sky, attacking the titan. The magic storm tears it apart.
The titan falls to pieces. It collapses violently into the snow.
Wide shot of this titan rampaging through medieval alpine villages, stomping on homes. Villagers flee the destruction.
Dolly out to show cheering villagers celebrating among the rubble of their village.
The old wizard lowers his clockwork hand, straightens his robes, and turns to head back inside.
The titan thrashes and flails as the powerful magics tear it apart.

My wizard doesn't return to his lair, for some reason. He decides to mosey off into the sunset. I lean into the mistakes, letting them steer the story. “Vibe coding” is a thing. Is this “vibe directing?” The two terms seem antithetical, but sometimes fighting the “happy accidents” makes things so much worse.

[AI Query]

* Dolly up and out as the old wizard walks off into the distance.

I suppose it's a more fitting end for a legendary hero after a feat of wizardry. With each generation, the output surprises me in good, bad, and very strange ways. As you see in Figure 21, the villagers are indeed celebrating. What you can't hear is the song they're singing. In unison, they belt out, “The darkness is gooooooone!” It's not the tone I was going for. Not at all.

Figure 21: I don't know why they're singing. No one gave them permission to sing.

I'm ready to compile the final cut, happy with an amusing bit of generic high fantasy when Scenebuilder collapses under the weight. Once I get more than eight or nine clips, it becomes impossible to rearrange them on the timeline as the interface promises. What should be a simple click-and-drag operation is an exercise in frustration. I can't see most of the clips and scrolling through them just doesn't work. It results in me putting clips in the wrong spots, deleting the wrong segments, and ultimately losing the sequence altogether. I made the mistake of refreshing the page. There was no method to save my work other than downloading the entire thing. Luckily, each clip I generated remains in the library, but the overall edit is gone. My 90-second epic disassembled itself.

Assessment

Instead of using settings, dials, and lenses, you use language. That's the L of NLP. In natural language processing, clarity of message is key. Dress it up with a fancy name like “prompt engineering” all you want, there's still an element of sorcery. Craft your wish as best you can. Be precise, but don't overload it with details, and hope that the genie isn't feeling fickle.

We Are So Close

The raw video clips produced by Veo 3 are remarkable but still have one foot planted in the uncanny valley. If you know what to look for, you can't help but spot the telltale signs of AI. Most of what you can produce with it will still be derided as AI slop, because it is. Visual storytelling is about making choices. Veo 3 doesn't let you make many choices outside of what you can prompt. It's difficult to imagine someone using Veo 3 to craft an entire film. When hundreds or thousands of shots come into play, that requires iterative generation, review, and refinement. That becomes a massive logistical challenge. The consistency and specificity required to truly shape your creative vision just isn't there. That's a lot of time and money to just get close enough.

The toolset provided for editing the timeline is slipshod and ineffective. The Scenebuilder fails on almost every level. It borders on useless, but that's very telling. Right now, Google isn't in the business of making non-linear editing suites. They don't care about that. That's not the point. They care about providing this stunning video generation model. I wouldn't expect many more bells and whistles to be added. That's better left in the hands of full editing environments like Adobe Premiere, and Google knows it.

It's that raw video material that's ripe with promise. The real benefit is found by adding it to existing processes. True utility can be had here by generating drone shots, temp scenes for workprints, or quickly converting storyboards to animatics. It should be one of many tools in a creator's arsenal, where the video can be ported into a proper post-processing system for color correction, audio mastering, polishing the motion graphics, etc. Human intervention is still a must for a fully realized product.

That said, even now we're starting to see the first batch of Gen AI commercials on the market. The canned water company, Liquid Death, just introduced a commercial full of cultists, explosions, and a whale on the highway. It was made by one person. Although the uncanny valley is unfit for what most people consider cinema, expect to see a surge of commercials and other shortform content made on the cheap. If a company can hire one person to produce a video instead of budgeting for an entire cast and crew, what do you think they'll choose?

Accelerating Returns

Veo 3 is a groundbreaking tool in a time when groundbreaking tools are being released every day. In January of 2025, this article was originally about Sora. It was ready for print, but print is a slow medium and artificial intelligence is advancing at a speed that's difficult to comprehend. The trajectory of video models over the last year shows exponential growth. Futurist Ray Kurzweil describes these leaps as part of the “law of accelerating returns.” Kurzweil's law holds that each new jump in capabilities builds on the last, compressing development timelines into a blur of progress. Less than a year ago, the best video models in the hands of consumers could only generate incoherent horror shows. Hands warped. Faces twisted into surreal nightmares. Native audio was a fantasy. Veo 3 illuminates the road ahead, showing us that as the technology iterates and stacks, full cinematic consistency isn't far off.

This is as bad as it will ever be. Eight-second clips will become five-minute short films. Those short films will grow into prompt-driven movies. Not long after that, we'll be able to tell our favorite movie generating service that we're looking for a murder mystery set on Venus, starring Jean Claude Van Damme and Betty White, directed by David Fincher, and it's an unofficial sequel to “Madam Web.” It will be packaged for streamers like Netflix, who will be able to look at your preferences and manifest a surgically targeted film. Ready or not, we'll all have movies made by clockwork hands.

Prompt Engineering

Prompt engineering is about shaping a very specific request in natural language to steer an AI model to the required output. A prompt engineer attempts to communicate the tone, context, and specific details to get as close to the intended result as possible. A well-engineered prompt ensures that the user's intent is clearly conveyed to the Large Language Model in a structured manner.

Prompt engineering is already being employed in literally every industry that uses generative AI. As these systems are integrated into various fields, prompt engineers will be responsible for customizing and tweaking inputs ensuring that these systems offer relevant responses.

AI-Driven Video Tools

AI-driven video tools are disrupting video production by introducing new capabilities in every aspect of the process. Video editing and enhancement features like Adobe Premiere Pro's AI tools or Runway's suite, use various automations to streamline post-production. These can be used for tasks like object removal, automatic scene detection, noise reduction, and relighting. Features like AI-driven motion capture are eliminating many technical barriers and putting expensive and complex technology into the hands of independent filmmakers.

Content generation tools, like OpenAI's Sora or Pictory, offer the ability to generate raw video from text. Currently, these tools are most prominent in marketing campaigns, education, and social media. Although not widespread in entertainment yet, elements of generative AI image and video can currently be found in major film and video game releases.

AI Executive Briefing

Experience the game-changing impact of AI through CODE Consulting's Executive Briefing service. Uncover the immense potential and wide-ranging benefits of AI in every industry. Our briefing provides strategic guidance for seamless implementation, covering crucial aspects such as infrastructure, talent acquisition, and leadership.

Discover how to effectively integrate AI and propel your organization into future success. Contact us today to schedule your executive briefing and embark on a journey of AI-powered growth. www.codemag.com/ai

Veo 3: The Clockwork Hand

Published in:

Filed under: