top of page

Gemini's Powerful But Overlooked Multimodal Capability

In a recent interview with The Verge,  Alphabet and Google CEO Sundar Pichai said, "We made Gemini natively multimodal... with audio, video, text, images, and code... working on the input and output side — and we are training models using all of that — maybe in the next cycle." That's huge. 

Homo sapiens appeared on the Earth some 300,000 years ago. For 99.99% of the time since then, humans perceived the world through sight, sound, smell, taste, and touch. The best evidence from symbolic and abstract behavior in early modern humans living in social groups is that language began around 100,000 to 200,000 years ago. Writing was invented only about 5,000 years ago. Literacy was limited to a tiny percentage of people until about 600 years ago when printing presses made books and pamphlets broadly available. 

Large language models have demonstrated amazing capabilities, but until recently, they processed text, which is the most recent form of human perception. Even as they began to accept images as input, they first had to convert images into text to process them, which limited their abilities to understand and act on the images. 

Pichai indicates that Gemini is now natively multimodal, which is a powerful capability overlooked in most commentary on recent announcements from Google I/O. Even in our time of high literacy, the primary way we perceive the world is visual. I recently had eye surgery and could not see for short periods of time, giving me a great appreciation for my sight. Even as you are reading this, your brain has an incredible ability to focus on the visible signals entering your eyes that are most important to the task at hand. Your brain is ignoring most of the light entering your eyes. 

Like humans, perception is the first and crucial element of the ability of artificial agents to become autonomously intelligent. Imagine an autonomously intelligent agent optimizing the production schedule on a plant floor. The agent visually perceives what is occurring at any point in time, such as where supplies and work-in-process are located or how humans and robots are moving. The agent focuses on the elements it can see that are important to optimizing the schedule and ignores other things it can see, such as the rotation of fan blades in the plant's ceiling. The agent's visual perception may exceed humans, for example, by seeing ultraviolet or infrared light, which may be important in quality control. It may see a chosen spectrum in a snapshot of time in an image, or it may be able to process the change in this spectrum over time in a video. The agent may be able to hear what's occurring in the plant. For example, it perceives a bang and then identifies that a box of supplies has tipped over. 

Being able to process audio, video, text, and images natively is a massive step towards autonomously intelligent agents that create The Intelligence Tsunami ( 


Let's talk.

Let's inspire your team and your organization to excel.

John Warner


0 views0 comments

Recent Posts

See All


Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page