• unwind ai
  • Posts
  • ChatGPT becomes Multimodal ๐ŸŒ

ChatGPT becomes Multimodal ๐ŸŒ

PLUS: Spotify's Multi-lingual Voice Cloning, Robots Learn from Internet Videos

Todayโ€™s top AI Highlights:

  1. ChatGPT can now See, Hear, and Speak

  2. Robotic Learning from Internet Human Videos

  3. Spotify Collaborates with OpenAI for Voice Cloning in Multiple Languages

  4. Getty Image Releases Text-to-Image Model

& so much more!

Read time: 3 mins

Latest Developments ๐ŸŒ

ChatGPT can See, Hear, and Speak ๐Ÿง’

OpenAI has introduced voice and image capabilities in ChatGPT. The new features will soon be rolled out to Plus and Enterprise Users and will be available on iOS and Android also.

Key Highlights:

  • Voice Capability: 

    • Will allow users to engage in natural voice conversations with ChatGPT. This feature offers five different voices, created in collaboration with professional voice actors, and is powered by a new text-to-speech model and Whisper.

    • Collaboration with partners like Spotify demonstrates the versatility of this voice technology for podcast translation, expanding storytelling reach.

    • The model is proficient at transcribing English text but performs poorly with some other languages, especially those with non-roman script.

  • Image Capability: 

    • Will let users show one or more images to ChatGPT for a wide range of tasks, like troubleshooting, data analysis, and reasoning tasks.

    • The mobile app includes a drawing tool to focus on specific image details.

    • The multimodal capabilities leverage GPT-3.5 and GPT-4, applying language reasoning skills to both text and images.

  • GPT-4V (Vision): OpenAI has released system card for GPT-4V, the model behind image capabilities in ChatGPT.

    • OpenAI collaborated with Be My Eyes to develop GPT-4V which was used to assist people with visual impairments.

    • GPT-4V's training process incorporates text and image data from the internet and licensed sources.

    • OpenAI's rigorous safety evaluations and mitigations along with RLHF and red-teaming to ensure responsible deployment and address challenges like hallucinations and high-stakes interpretations.

Robots Learn from Videos ๐Ÿ“บ

Researchers at Google DeepMind and UC Berkely have introduced Video Pre-Training for Robots (V-PTR), that leverages internet-scale human video data to enhance robotic reinforcement learning (RL) and teach robots valuable skills by watching human videos online.

Key Highlights:

  • Internet videos are rich in real-world experiences, but they lack the specific information needed for robots to understand and replicate actions. V-PTR bridges this gap and enables robots to generalize and perform tasks more effectively.

  • V-PTR takes a step-by-step approach, teaching robots the big picture from videos, actions that lead to outcomes, and how to apply this knowledge to specific tasks.

  • This research highlights the effectiveness of TD-learning, allowing robots to learn and improve by watching human actions in videos, a significant leap forward in robotic learning.

Speak from English to Espaรฑol in a Jiffy ๐ŸŽ™๏ธ

Spotify is introducing an AI-powered voice translation feature in partnership with OpenAI, allowing podcasters to replicate their voices in other languages.

Key Highlights:

  • Initially, the tool will translate English-language podcast episodes into Spanish, with plans to add French and German translations in the near future.

  • The core technology behind this feature is OpenAI's speech-to-text model Whisper, which can transcribe English and translate other languages into English.

  • Watch some notable podcasters like Lex Fridman, Dax Shepard, and Steven Bartlett trying out this feature:

AI Art + Copyright Protection ๐Ÿ”’

Getty Images has released its text-to-image model Generative AI by Getty Images that utilizes an AI model provided by Nvidia, and was trained on a portion of Getty's extensive library of approximately 477 million stock assets.

Getty's tool not only competes with DALL.E-3 and Midjourney but also offers protection against copyright lawsuits and the right to use the images worldwide and perpetually.

Getty Images AI generator

Tools of the Trade โš’๏ธ

  • ChatDev: Virtual software company with multiple intelligent AI agents that form a multi-agent organizational structure. It is a highly customizable and extendable framework based on LLMs for studying collective intelligence.

  • Recall: AI knowledge base that summarizes, categorizes, and reviewes online content with features like automatic categorization and spaced repetition.

  • Vespio AI: Boost sales with AI-powered conversation analysis, sentiment prediction, and smart suggestions for higher win rates and revenue.

  • Labelbox: A data-centric platform for building smart apps, offering unified LLM creation, vision tools, AI model integration and data visualization.

  • Edgar: Your 24/7 AI assistant designed for streamlining tasks, automating workflows, managing outreach, and enhancing productivity through intuitive chat interactions.

๐Ÿ˜ Enjoying so far, TWEET NOW to share with your friends!

Hot Takes ๐Ÿ”ฅ

  • Mfers will equate having llama-3 on your local machine to holding a tactical nuke ~ anton

  • short timelines and slow takeoff will be a pretty good call i think, but the way people define the start of the takeoff may make it seem otherwise ~ Sam Altman

  • Your appโ€™s name doesnโ€™t matter. The most important product of this century is called ChatGPT. ~ Nikita Bier

Meme of the Day ๐Ÿคก

r/ProgrammerHumor - myDeveloperKnowledgeCutoffIsSeptember2021

Thatโ€™s all for today!

See you tomorrow with more such AI-filled content. Donโ€™t forget to subscribe and give your feedback below ๐Ÿ‘‡

Real-time AI Updates ๐Ÿšจ

โšก๏ธ Follow me on Twitter @Saboo_Shubham for lightning-fast AI updates and never miss whatโ€™s trending!!

PS: I curate this AI newsletter every day for FREE, your support is what keeps me going. If you find value in what you read, share it with your friends by clicking the share button below!

Reply

or to participate.