KOSMOS-2: Microsoft’s New AI Breakthrough Generating Text, Images, Video & Sound in Real-Time!

Microsoft has launched a new AI, Cosmos 2, which not only improves how we interact with AI, but also takes multimodal AI technology to a new level. Ever wished for easier ways to chat with AI, like using pictures instead of long text?

What if your AI could understand and respond to images just as you do?

What if you could ask your AI to draw something for you, or to explain what it sees in a picture?

Well, that’s exactly what Cosmos 2 can do, and more. But before we dive into the details of this amazing model, let’s first get some background and context on what multimodal AI is and why it matters. Multimodal AI is a type of artificial intelligence that merges different kinds of data like text, images, videos, and sounds. Its aim is to build AI systems that can understand and create content from various sources, just like humans do. In the past, AI systems could only manage one type of data at a time. For instance, some could process text, while others could handle images or sound, but they couldn’t mix these types of data or relate them to each other.

However, this has changed with the creation of multimodal large language models, MLMs. They use another type of AI model known as large language models, LLMs, like GPT-3 or BERT, to understand and generate different kinds of data. LLMs convert all these different data types into a form they can understand called tokens. This way, they can work with multiple types of data at once and generate mixed content, much like we do in our everyday life.

Last year, Microsoft unveiled Cosmos 1, a groundbreaking multimodal language model, trained on large-scale web data containing text, images, and their combinations. It excelled at tasks like writing stories from images, creating image captions, and answering questions about images. However, Cosmos 1 had its limitations, particularly in understanding and connecting visual information. To illustrate, when you see an image of a dog chasing a ball in a park, you instantly understand the scene and can describe or locate elements using words or coordinates.

On the other hand, Cosmos 1 struggles with this. It can turn the image into tokens but can’t truly grasp its meaning or link it to other data types. It can generate text based on the image but can’t point to specific areas in the image using words or coordinates. It can’t visually respond, such as highlighting the dog or ball, nor answer questions requiring visual reasoning like the dog’s distance from the ball or the ball’s color.

Now, Cosmos 2, the latest version of Microsoft’s MLLM, introduces a feature called grounding. This feature allows Cosmos 2 to interact with images more accurately and meaningfully, using words or coordinates to refer to specific parts of an image, like drawing a circle around a dog or labeling a ball chasing. Grounding works by creating hyperlinks between parts of an image and its description. This allows it to connect images with words almost like a game of connect the dots. When you click on a part of an image, it can show you the matching word or sentence that describes it, and it works the other way around too. Here’s how it does this. First, Cosmos 2 imagines a picture is like a checkerboard, breaking it up into squares. Each square gets a special tag, like lock 1 1 or lock 3 4. These tags are then added to the picture’s description at the right spot. Let’s say a picture shows a dog chasing a ball. If the dog’s head is in the square tagged as lock 1 1, the sentence describing the picture would read, a lock 1 1 dog is chasing a ball in a park.

When you click on lock 1 1, Cosmos 2 highlights the dog’s head in the picture, and vice versa. This way, Cosmos 2 makes it easier to match words with parts of a picture, helping people and computers understand each other better. Grounding makes Cosmos 2 more dynamic and precise than other machine learning language models, MLMs, enabling more human-like interaction. Wondering how Cosmos 2 acquired this ability, it learned from a training method known as next word prediction. The model predicts the following token based on given text. For instance, given the words a dog is, Cosmos 2 may predict chasing.

Unlike other models, Cosmos 2 can predict not just text, but image and location tokens as well. When given a token like A, it might predict an image token IMG. The model can then generate an image like a dog chasing a ball, and continue the sentence based on this image. The model was trained on two extensive corpora, Leon 2B, which contains 2 billion image text pairs, and Kojo 700M, a text-only corpus with 700 million entries. Cosmos 2 learned to convert images into tokens, generate text from images and vice versa, and use location tokens for grounding. Thus, by learning from diverse multimodal data, Cosmos 2 is equipped to handle various tasks efficiently.

Let’s quickly review how Cosmos 2 performs and its practical applications. Cosmos 2 excels in tasks such as identifying phrases in images and processing language. It consistently outperforms other models. For example, in locating phrases in images, it achieves 91.3% accuracy, compared to the best other model scores of 78.4% and 86.7%.

But what makes Cosmos 2 special goes beyond its performance?

Here are some practical uses. Grounded picture captioning. Cosmos 2 can generate detailed captions for images, marking specific regions with location tokens. This can help people with visual impairments, understand images, assist students in learning new concepts, and enable content creators to craft more immersive stories. Grounded visual question answering, it can answer questions about specific regions within images, denoted by location tokens. This is handy for users looking for detailed information about an image, researchers extracting insights, or customers making image-based decisions. Grounded visual reasoning. It can perform logical or mathematical operations based on specific regions in an image.

This has potential in helping solve image-based puzzles, teaching mathematical skills, or providing a foundation for game developers to create new challenges. This AI has countless uses and benefits tailored to your needs, and honestly, the potential is limitless. But don’t take my word for it.

You can try Cosmos 2 for yourself and see what it can do. Microsoft has released an online demo of Cosmos 2 on GitHub, where you can interact with the model and test its capabilities. You can upload your own images or use the provided ones, and you can ask questions or give instructions to the model using text or voice. You can also see how the model creates hyperlinks between image elements and captioned tokens, and how it provides visual responses.

The demo is very easy to use and very fun to play with. You can explore different scenarios and tasks and see how Cosmos 2 responds to them, and you can also compare Cosmos 2 with other MLMs or Unimodal models and see how it performs better than them. I highly recommend you to check out the demo and experience Cosmos 2 for yourself. You will be amazed by what this model can do and how it can change the way you interact with AI. Go ahead and try it out and let me know what you think in the comments section.

I would love to hear your feedback and suggestions. I hope you enjoyed it and learned something new. Thank you so much for listening and I’ll see you in the next one.

KOSMOS-2: Microsoft’s New AI Breakthrough Generating Text, Images, Video & Sound in Real-Time!

Did you find this article valuable?