OpenAI made a major announcement on Monday regarding an update to ChatGPT. This update empowers the GPT-3.5 and GPT-4 AI models to analyze images and respond to them within text conversations. Additionally, OpenAI is enhancing the ChatGPT mobile app with speech synthesis capabilities, enabling users to engage in fully verbal exchanges with the AI assistant.
OpenAI intends to roll out these features to Plus and Enterprise subscribers over the next two weeks. Notably, speech synthesis will be available exclusively on iOS and Android devices, while image recognition will be accessible on both the web interface and mobile applications.
The image recognition feature in ChatGPT allows users to upload one or more images during conversations, utilizing either the GPT-3.5 or GPT-4 models. OpenAI suggests various practical applications for this feature, such as identifying dinner options by capturing images of the fridge and pantry or troubleshooting issues like a non-starting grill. Users can even use their device’s touch screen to highlight specific parts of the image they want ChatGPT to focus on.
OpenAI provides a promotional video demonstrating a hypothetical interaction with ChatGPT, where a user seeks guidance on adjusting a bicycle seat, providing images, an instruction manual, and a picture of their toolbox. ChatGPT offers advice on completing the task, although real-world effectiveness remains untested.
Regarding the technical aspects, OpenAI hasn’t disclosed the inner workings of GPT-4 or its multimodal functionality. However, it’s known that multimodal AI models typically convert text and images into a shared encoding space, allowing them to process different types of data through a unified neural network. OpenAI might use techniques like CLIP to connect visual and text data, aligning image and text representations in the same latent space, facilitating contextual deductions across text and images.
In the realm of audio, ChatGPT introduces a voice synthesis feature that enables spoken conversations with the AI. OpenAI labels this as a “new text-to-speech model.” Users can choose from five synthetic voices like “Juniper,” “Sky,” “Cove,” “Ember,” and “Breeze.” OpenAI has collaborated with professional voice actors to craft these voices. The Whisper speech recognition system, previously integrated with the ChatGPT iOS and Android apps, continues to transcribe user speech input.
OpenAI emphasizes that ChatGPT has limitations, acknowledging potential issues like misidentifying objects (visual confabulations) and imperfect recognition of non-English languages. The company has conducted risk assessments and sought input from alpha testers, advising caution, especially in high-stakes or specialized contexts.
To address privacy concerns, OpenAI has implemented technical measures to restrict ChatGPT’s ability to analyze and make direct statements about individuals. They emphasize that ChatGPT is not always accurate and should respect privacy.
While OpenAI markets these new features as giving ChatGPT the ability to “see, hear, and speak,” some AI researchers caution against anthropomorphism and hype, clarifying that ChatGPT is an AI model integrated with sensors for different data modalities.
In summary, OpenAI’s recent announcement regarding ChatGPT represents a significant expansion of its capabilities, with image analysis and voice synthesis features. The actual performance of these updates remains to be evaluated, and OpenAI plans to gradually introduce them to users while refining risk mitigations and improvements over time. Further developments will be shared as these features become widely available in the coming weeks.