Could text-to-video technology, emerging from companies such as Runway and nVidia, spell doom for Hollywood and undermine our sense of what is real?
In the last couple of years, AI models that can conjure original images out of a few lines of text have taken over the internet, flooding our online lives with everything from elaborate fantasy landscapes, to eerie digital demons, to photojournalism from uncanny alternate histories. At the same time, the rise of another digital technology, deepfakes, has seen famous figures’ faces plastered onto bodies that aren’t their own, inserted into everything from porn to dystopian disinformation campaigns. Both technologies have played their part in transforming ideas about creativity and undermining the reliability of our visual media, but the disruption doesn’t look set to slow down anytime soon.
The race to develop AI generators that can output not just still images, but realistic videos, has been on for some time. In December 2022, we got an embryonic preview of what that might look like courtesy of the surreal sitcom Nothing, Forever, which produced sketchy pixel-art graphics with the help of several AI systems, including the large language model GPT-3. In the last few weeks, though, the race has really been heating up, with the unveiling of the first genuine text-to-video models.
Essentially, working text-to-video entered the public consciousness via a mildly-horrifying visual of Will Smith ravenously making his way through plates of spaghetti in early April. The product of ModelScope, a text-to-video system developed by a collaborative team at Hugging Face, the viral video wasn’t alone – see also: Dwayne Johnson chomping on rocks, Arnold Schwarzenegger punching a pizza, and a Chuckle Brother performing a guitar solo in front of an erupting volcano – but it happened to be the example that captured the public imagination, so it became the disfigured face of the coming AI video revolution. Just a month later, however, that face is looking much more realistic, and the revolution seems increasingly imminent.
With several companies competing to produce the most realistic AI video on the market, there are some obvious concerns. What will this mean for the already-skyrocketing rate of misinformation online? What will our creative landscape look like, when anyone can conjure up a full film with just a script and a dream? Is Hollywood doomed? Below, we go in search of some answers.
Generate videos with nothing but words. If you can say it, now you can see it.
— Runway (@runwayml) March 20, 2023
Introducing, Text to Video. With Gen-2.
Learn more at https://t.co/PsJh664G0Qpic.twitter.com/6qEgcZ9QV4
WHAT IS AN AI TEXT-TO-VIDEO GENERATOR, ANYWAY?
By now, you’ve probably heard of (or even experimented with) some of the biggest text-to-image generators out there, such as DALL-E 2, Imagen, and Midjourney. If so, the basic concept will be familiar: input a few lines of text prompts, describing things like an object or situation, the style in which it’s depicted, and even specifics like angles and camera lenses, and watch the generator bring it to life in a few seconds.
From a users’ point of view, text-to-video generators are essentially the same. As the AI-powered “creative suite” Runway says: “Generate videos with nothing but words. If you can say it, now you can see it.”
WHY ARE THEY SUDDENLY BLOWING UP?
Text-to-video has been around, as a concept, for some time. Back in September 2022, Meta announced the imaginatively-titled Make-A-Video, which was shortly followed by the announcement of Google’s Imagen Video. The thing is, all they had to share was research papers and a few pre-made examples.
Like text-to-image models before them, text-to-video generators are mostly blowing up now because they’ve reached a point where normal people (meaning, not just AI experts) can understand how to use them to create something funny, or weirdly beautiful, or just about realistic enough to be entertaining. This is how complex creative technologies tend to spread into the mainstream. The obvious ethical questions tend to follow shortly after.
Will Smith eating spaghetti via text to video AI.
— Anonymous (@YourAnonNews) March 28, 2023
source: https://t.co/et1Lx1rIdjpic.twitter.com/WHm9WJVJAA
THEY’RE HARDLY GENERATING DUNE THOUGH, ARE THEY?
Admittedly, no. They’re definitely not on a par Dune. To be honest, the best AI-generated video couldn’t even hold a candle to your least favourite Marvel movie... yet. That “yet” is important, though. Remember what AI-generated images looked like when they first arrived – fuzzy blobs and lumpy, misshapen knockoffs of real artists’ work? Now, just a couple of years later, they’re being mistaken for real photos left, right, and centre, and being used to push the boundaries of human creativity. The AIs are even starting to figure out those notoriously-finicky hands.
Whether you like it or not, there’s no reason to expect that the upward curve of text-to-video improvements won’t be similarly steep, especially as more competitors enter what’s likely to be a very profitable race to produce commercially-viable results.
this AI beer commercial looks exactly like how an alien intelligence would understand our beer commercials pic.twitter.com/mn3OzW32ww
— Armand Domalewski (@ArmandDoma) May 1, 2023
WHO IS LEADING THE CHARGE?
Runway, one of two tech startups behind the controversial AI art generator Stable Diffusion, started publicly testing its Gen-2 video model last month, with uncanny results popping up across social media. Gen-2 users can input simple text prompts to create video from scratch, as well as offering the option to incorporate prompt images, such as portraits (think: deepfakes, but with full creative control).
ModelScope, meanwhile, is a relatively basic model created by a research division of e-commerce giant Alibaba. Creating the weird and short-form clips that characterised the early text-to-video buzz, the company is also controversial for the fact that many generated videos include blatant Shutterstock watermarks, betraying its image-scraping sources.
Then, of course, there are the bigger players like Meta or Google – but, despite some creepy photorealistic Mark Zuckerberg avatars floating around the internet recently, both have remained relatively quiet since their first research papers emerged last year. Another promising research project (or ominous, depending on your outlook) comes courtesy of software company nVidia, and it’s promised to unveil much more at an upcoming conference in August.
Other users are generating more convincing AI video by combining more than one AI tool – for example, generating images in Midjourney, then bringing them to life via Runway. You might also recognise this high-definition (though less animated) aesthetic from the reimagining of Harry Potter as a Balenciaga fantasy, or the fake trailer for a Wes Anderson remake of Star Wars.
A new #GenerativeAI method by NVIDIA researchers uses off-the-shelf, pre-trained latent diffusion models (LDMs) to turn image generators into high-resolution video generators.
— NVIDIA AI Developer (@NVIDIAAIDev) April 20, 2023
Project Site: https://t.co/kjoFyQc2TO
Paper: https://t.co/ZA1uLXTi9xpic.twitter.com/vmLYqPIOOG
WHAT DOES THE IMMINENT TEXT-TO-VIDEO BOOM MEAN FOR THE FUTURE?
At this point, it’s pretty much undeniable that text-to-video generators are coming, and they’re going to have a huge effect on the media landscape, just like image generators did before them. In fact, we can basically expect more of the same reactions, from widespread fears about job losses in the creative industries (particularly in fields such as special effects), to confusion about the authenticity of what we’re seeing in front of our eyes.
Of course, supporters of text-to-video will claim that there are many advantages, as well. New technologies – from the camera, to digital film or CGI – pretty much always expand our artistic horizons, for better or for worse, and it’s possible that AI video will help unlock entirely new forms of expression. Failing that, we might at least be able to tailor-make our own TV shows to keep us occupied after our jobs are automated out of existence. Who knows, with the right prompting we might even get a sneak preview of our demise at the hands of the robot overlords.
Yerrr, The exploration of Film & AI continues!!! 🖖🏾
— Jah. (@ArtByJah) April 27, 2023
This is #HelmetCity Test II
An all-AI-generated video using #HelmetCity images generated on @midjourney and combining them with written prompts on @runwayml Gen2 Text to Video, then finally edited in premiere pro. With music… pic.twitter.com/ar6cCPEkI2