Google DeepMind has quietly unveiled a significant advancement in their artificial intelligence (AI) research. On Tuesday, they introduced a novel autoregressive model with the aim of enhancing the comprehension of lengthy video inputs.
This new model, dubbed “Mirasol3B,” presents a groundbreaking approach to multimodal learning. It adeptly processes audio, video, and text data in a more cohesive and efficient manner.
According to a comprehensive blog post co-authored by Isaac Noble, a software engineer at Google Research, and Anelia Angelova, a research scientist at Google DeepMind, the challenge in constructing multimodal models lies in the diversity of the modalities.
“Some of the modalities might be well synchronized in time (e.g., audio, video) but not aligned with text,” they elucidate. “Furthermore, the large volume of data in video and audio signals is much larger than that in text. When combining them in multimodal models, video and audio often cannot be fully consumed and need to be disproportionately compressed. This problem is exacerbated for longer video inputs.”
In response to this intricacy, Google’s Mirasol3B model untangles multimodal modeling by segregating it into distinct autoregressive models that process inputs according to the characteristics of each modality.
“Our model comprises an autoregressive component for the time-synchronized modalities (audio and video) and a separate autoregressive component for modalities that are not necessarily time-aligned but are still sequential, e.g., text inputs, such as a title or description,” explain Noble and Angelova.
This announcement arrives at a juncture when the tech industry is endeavoring to harness AI’s potential to analyze and comprehend extensive data across various formats. Google’s Mirasol3B represents a notable stride in this pursuit, paving the way for new applications like video question answering and long video quality assurance.
Potential applications on YouTube are among the avenues Google may explore. YouTube is the world’s largest online video platform and a major revenue source for the company.
The model could theoretically enhance user experience and engagement by offering a broader range of multimodal features and functionalities. These could include generating captions and summaries for videos, answering questions, providing feedback, creating personalized recommendations and advertisements, and enabling users to craft and edit their videos using multimodal inputs and outputs.
For instance, the model could generate captions and summaries for videos based on both visual and audio content, and empower users to search and filter videos by keywords, topics, or sentiments. This could substantially enhance video accessibility and discoverability, facilitating quicker content discovery for users.
The model could also potentially respond to user questions and provide feedback based on video content, such as explaining the meaning of terms, offering additional information or resources, or suggesting related videos or playlists.
The announcement has elicited a mixed response from the AI community, generating both enthusiasm and skepticism. Some experts laud the model’s adaptability and scalability, envisioning its potential applications across various domains.
For example, Leo Tronchon, an ML research engineer at Hugging Face, tweeted: “Very interesting to see models like Mirasol incorporating more modalities. There aren’t many strong models in the open using both audio and video yet. It would be really useful to have it on [Hugging Face].”
Gautam Sharda, a computer science student at the University of Iowa, tweeted: “Seems like there’s no code, model weights, training data, or even an API. Why not? I’d love to see them actually release something beyond just a research paper?”
This announcement represents a significant milestone in the realm of artificial intelligence and machine learning, showcasing Google’s ambition and leadership in developing cutting-edge technologies that have the potential to enhance and transform human lives.
However, it also presents a challenge and opportunity for researchers, developers, regulators, and AI users. They must ensure that the model and its applications align with society’s ethical, social, and environmental values and standards.
As the world becomes increasingly multimodal and interconnected, it is imperative to foster a culture of collaboration, innovation, and responsibility among stakeholders and the public. This will help create a more inclusive and diverse AI ecosystem that benefits everyone.