In a significant stride beyond existing text-to-image advancements exemplified by Midjourney and Dall-E 3, Google has unveiled its latest creation, VideoPoet. This cutting-edge large language model (LLM) boasts unparalleled versatility as it ventures into the realm of video generation, setting a new standard in multimodal capabilities.Developed by Google's team of scientists, VideoPoet stands out as a robust LLM with the ability to process a diverse array of inputs, including text, images, video, and audio, culminating in the generation of videos. What distinguishes VideoPoet is its 'decoder-only architecture,' a unique design enabling content creation for tasks not explicitly part of its training regimen. The model undergoes a two-step training process akin to LLMs, involving pretraining and task-specific adaptation, allowing it to be customized for various video generation tasks.
According to Google researchers, VideoPoet's strength lies in its simplicity as a modeling method, transforming any autoregressive language model or LLM into a high-quality video generator. Unlike prevailing video models utilizing diffusion models that introduce noise to training data, VideoPoet integrates multiple video generation capabilities into a unified language model. Notably, it surpasses other models by consolidating all its components into a single LLM, excelling in tasks such as text-to-video, image-to-video, video inpainting and outpainting, video stylization, and video-to-audio generation.Operating as an autoregressive model, VideoPoet generates output by leveraging cues from its previous creations. Trained on video, audio, image, and text with tokenizers facilitating the conversion of input into various modalities, VideoPoet showcases the promising potential of LLMs in video generation. Tokenization, a pivotal aspect of Natural Language Processing, involves breaking input text into smaller units or tokens, enhancing the AI's ability to comprehend and analyze human language.
Researchers anticipate VideoPoet's future support for a versatile 'any-to-any' format. Beyond its core capabilities, the model surprises by creating short films by combining multiple video clips. Google Bard was enlisted to write a short screenplay with prompts, which the researchers transformed into a compelling video, showcasing the model's adaptability in assembling diverse content.Despite its strengths, VideoPoet faces a limitation in producing longer videos. Google addresses this by proposing conditioning the last second of videos to predict the next second. Moreover, the model exhibits the ability to manipulate existing videos, altering the movement of objects within, illustrated by the intriguing example of the Mona Lisa yawning. In essence, VideoPoet marks a significant leap in multimodal LLMs, showcasing the transformative potential of language models in the dynamic field of video generation.