Microsoft’s newest AI video technology elevates standards with its advanced trajectory-focused generation capabilities

ByYasmeeta Oon

Jan 10, 2024

In the rapidly evolving landscape of artificial intelligence, a significant development has emerged in the realm of video generation. A number of AI companies, including giants like Stability AI and Pika Labs, have made notable strides in this field over the past few months. These firms have introduced models capable of creating various types of videos using text and image prompts. Building on these advancements, Microsoft AI has recently introduced a groundbreaking model designed to offer more nuanced control over video production.

The newly introduced model, named DragNUWA, enhances the existing methodologies of text and image-based prompting with a trajectory-based generation technique. This innovation allows users to precisely manipulate objects or even entire video frames along specified trajectories. This feature facilitates the creation of videos that are highly controllable in terms of semantic, spatial, and temporal aspects, all while maintaining a high standard of quality.

To foster community engagement and development, Microsoft has made the model weights and a demo of DragNUWA publicly available as an open-source project. It’s important to recognize that DragNUWA is still in the research phase and has not yet reached perfection.

What distinguishes Microsoft’s DragNUWA in the competitive field of AI-driven video generation? Traditionally, video generation powered by AI has relied on inputs based on text, images, or trajectories. While these approaches have shown promise, they often fall short in providing detailed control over the resulting video.

The integration of text and images alone, for example, tends to miss the intricate motion details inherent in videos. On the other hand, combining images and trajectories may not fully capture the nuances of future objects and trajectories. Language inputs can lead to ambiguity, especially when dealing with abstract concepts. A case in point would be the inability to differentiate between an actual fish and a painting of a fish.

To address these limitations, Microsoft’s AI team introduced DragNUWA in August 2023. This open-domain, diffusion-based video generation model unifies images, text, and trajectory inputs. This amalgamation enables users to achieve precise control over the video generation process, encompassing semantic, spatial, and temporal elements. Users can explicitly define the desired text, image, and trajectory in their inputs to control various aspects of the video, such as camera movements (including zooming in or out) and the motion of objects.

For instance, a user could upload an image of a boat on a lake, add a text prompt like “a boat sailing in the lake,” and specify the boat’s trajectory. The result would be a video depicting the boat moving in the indicated direction, closely aligning with the user’s vision. The trajectory input adds motion detail, language input helps predict future objects, and the image input distinguishes between different objects.

Microsoft’s early 1.5 version of DragNUWA, now available on Hugging Face, incorporates Stability AI’s Stable Video Diffusion model. This feature allows users to animate an image or its elements according to a specific path. When fully developed, this technology could significantly simplify the processes of video generation and editing. Users could potentially transform backgrounds, animate static images, and direct motion paths with simple gestures like drawing a line.

The AI community has expressed considerable excitement about this development, viewing it as a significant leap in the field of creative AI. However, the true test for DragNUWA will be its performance in real-world applications. In its preliminary tests, Microsoft reported that the model was capable of accurately executing camera movements and object motions with various drag trajectories.

Microsoft’s researchers highlight several key capabilities of DragNUWA. Firstly, the model supports complex curved trajectories, enabling the generation of objects moving along specific, intricate paths. Secondly, it allows for variable trajectory lengths, with longer trajectories producing more substantial motion. Lastly, DragNUWA can control the trajectories of multiple objects simultaneously. To the best of Microsoft’s knowledge, no existing video generation model offers such a degree of trajectory controllability, marking DragNUWA as a significant potential contributor to the advancement of controllable video generation in future applications.

This development adds to the ever-growing body of research in the AI video generation space. Just recently, Pika Labs gained attention by opening access to its text-to-video interface, which functions similarly to ChatGPT. This interface produces high-quality short videos with a variety of customization options.

As AI continues to evolve, models like DragNUWA exemplify the potential for significant advancements in video generation and editing. By combining text, images, and trajectories, these models offer a level of control and precision previously unattainable. The future of AI-driven video generation looks promising, with ongoing research and development paving the way for more sophisticated and user-friendly applications.

Yasmeeta Oon

Just a girl trying to break into the world of journalism, constantly on the hunt for the next big story to share.