Intelligent Agents with "Common Sense"

John Warner
Apr 8, 2024
3 min read

Updated: Apr 9, 2024

The Intelligent Tsunami (https://www.amazon.com/dp/B0CXH5DDRP) delves into an approach by AI thought leader Yann LeCun to give intelligent agents with "common sense" like humans have. LeCun proposes a technique he calls Joint Embedded Predictive Architecture (JEPA). This article describes I-JEPA: The first AI model based on Yann LeCun’s vision for more human-like AI (https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/)

LeCun often discusses that while a teenager often can be taught to drive with a few hours of training, we still haven't achieved full self-driving in autonomous vehicles despite spending immense time and money trying to train these systems. The teenager has likely spent much of her life as a passenger in a car before she tried to drive one. She self-learned a "world-model" for how a car operates, for example, gaining an intuitive feel for how traffic typically flows. She also learned what parts of what she sees in front of her to pay attention to. For example, the cars and pedestrians she sees are important, but she can pay less attention to the trees on the side of the road because they won't move in front of her. All of this self-learning adds up to what we call "common sense" in humans. It would take a teenager who grew up on a remote island without cars in cities much longer to learn to drive than teenagers who grew up in our society.

JEPA attempts to address one of the biggest challenges in artificial intelligence, developing systems that can learn and reason more like humans. Current AI models, such as those used in current large language models like Chat GPT, while incredibly powerful for specific tasks, lack the general "common sense" understanding of the world that allows humans to adapt to new situations so readily.

JEPA is a proposed new direction to overcome this limitation - creating machines that can learn internal models that capture how the world works. The goal is for AI to build up a rich knowledge base through observing data, similar to how humans accumulate common sense through experience.

This approach is known as self-supervised learning, where the AI tries to predict aspects of its input data rather than relying on human-labeled datasets. For example, an image recognition model could learn by attempting to fill in masked regions of images. However, current self-supervised techniques like this have limitations.

Image Joint Embedding Predictive Architecture (I-JEPA) takes a different approach inspired by LeCun's vision. Rather than trying to reconstruct specific pixels, I-JEPA aims to create an abstract representation capturing the high-level semantics and positional relationships within an image.

Here's how it works: The I-JEPA model is trained on images with some sections masked out. From the visible portion, it tries to predict a compressed representation encoding the meaningful features and layouts for the masked regions rather than filling in exact pixel values.

For example, if part of a dog's head is masked, I-JEPA would output a representation describing a dog head's typical shape and position without obsessing over recreating every tuft of fur. This matches more closely how humans perceive the world at a conceptual level.

Experiments showed that even without being explicitly trained on labeled data, this I-JEPA approach allows the AI to learn powerful representations capturing object semantics and spatial relationships. It achieved state-of-the-art performance on benchmarks like ImageNet classification.

Importantly, I-JEPA is much more computationally efficient than traditional approaches, thanks to only processing high-level representations rather than manipulating full images or text. The I-JEPA model achieved cutting-edge results on ImageNet using 16 GPUs in under 3 days, where other methods typically require 2-10x more computing power.

The promise of I-JEPA and related "joint embedding predictive architectures" is to break through current limitations by mimicking how humans develop an intuitive understanding of the world through observation and prediction. Rather than mere pattern matching, the AI builds conceptual models enabling deeper reasoning and planning for novel scenarios.

While I-JEPA is just an initial step focused on computer vision tasks, the model is being open-sourced and will be expanded to handle video, audio, text and multimodal data. The ultimate goal is an AI that combines rich world knowledge with analytical reasoning abilities - a key milestone towards achieving artificial general intelligence.

Let's talk.

Let's inspire your team and your organization to excel.

John Warner

JohnWarner@InnoVenture.com

864-561-6609

https://www.innoventure.com/

The Intelligence Tsunami: Get Off the Beach, Ride the Wave, or Get Swept Away