Yang Likun: It is nonsense to rely solely on LLM to achieve AGI; the future of AI requires a JEPA world model (A ten-thousand-word interview at the GTC conference)

2025-04-19 06:02:58

This article collates a verbatim transcript of a public conversation between Yann LeCun, Meta's chief AI scientist and Turing Award winner, and NVIDIA chief scientist Bill Dally. LeCun explains why he thinks large language models (LLM) can never really implement AGI. (Synopsis: OpenAI releases o3 and o4-mini The strongest inference models: can think about pictures, automatically select tools, and make breakthroughs in mathematics and coding performance) (Background supplement: OpenAI secretly creates its own "own community platform", pointing to Musk's X) When large language models (LLM) are accelerating the world's embrace of AI, Yann LeCun, known as the father of convolutional neural networks and now the chief AI scientist at Meta, recently said surprisingly that his interest in LLM has waned, and he even believes that it is close to the bottleneck of LLM development. In an in-depth conversation with NVIDIA Chief Scientist Bill Dally last month, LeCun detailed his unique insights into the future direction of AI, emphasizing that understanding the physical world, lasting memory, reasoning and planning capabilities, and the importance of the open source ecosystem is the key to leading the next wave of AI revolution. Bill Dally: Yann, a lot of interesting things have happened in the AI space over the past year. In your opinion, what has been the most exciting development of the past year? Yann LeCun: Too many to count, but let me tell you one thing that might surprise some of you. I'm not that interested in large language models (LLMs) anymore. LLMs are already at the tail end, they are in the hands of the product people in the industry, but they are improving at the marginal level, trying to get more data, more computing power, generate synthetic data. I think there are more interesting problems in four areas: how to make machines understand the physical world, how to make them have a lasting memory, which is not talked about much, and the last two are how to get them to reason and plan. Of course, there has been some effort to get LLM to do reasoning, but in my opinion this is a very simplified way of looking at reasoning. I think there might be a better way to do this. So, I'm excited about things that a lot of people in the tech community might not be excited about until five years from now. But now, they look less exciting because they're some obscure academic papers. Understanding the World Model and the Physical World Bill Dally: But what would it be if LLM wasn't reasoning about the physical world, having persistent memory, and planning? What will the underlying model be? Yann LeCun: So, a lot of people are working on the world model. What is a world model? We all have models of the world in our heads. It's basically something that allows us to manipulate our minds. We have a model of the current world. You know if I push this bottle from above, it's likely to tip over, but if I push it from the bottom, it slides. If I press too hard, it may burst. Yann LeCun interview screenshot We have models of the physical world, which we acquire in the first months of our lives, which allows us to cope with the real world. Coping with the real world is much more difficult than coping with language. We need a system architecture that can really handle real-world systems that are completely different from what we currently deal with. The LLM predicts tokens, but tokens can be anything. Our self-driving car model uses tokens from sensors and generates tokens that drive the vehicle. In a sense, it's reasoning about the physical world, at least about where it's safe to drive and where you don't hit a pillar. Bill Dally: Why isn't token the right way to represent the physical world? Yann LeCun: Tokens are discrete. When we talk about tokens, we usually mean a finite set of possibilities. In a typical LLM, the number of possible tokens is around 100,000. When you train a system to predict tokens, you can never train it to predict exactly following tokens in a sequence of text. You can generate a probability distribution about all possible tokens in your dictionary, which is just a long vector of 100,000 numbers between zero and one with a sum of one. We know how to do that, but we don't know what to do with the film, with that high-dimensional, continuous organic data. Every attempt to get a system to understand the world or build a mental model of the world, by training it to predict pixel-level films, has largely failed. Even training a system that resembles some kind of neural network to learn a good representation of an image fails by reconstructing the image from a corrupted or converted version. They work a bit, but not as well as alternative architectures we call joint embedding, which basically doesn't try to rebuild at the pixel level. They try to learn an abstract representation of an image, movie, or natural signal being trained so that you can make predictions in that abstract representation space. Yann LeCun: The example I use a lot is if I shoot a video of this room, move the camera and stop here, and then ask the system to predict what follows up on that movie, it might predict that this is a room with people sitting in it and so on. It cannot predict what each of you will look like. This is completely unpredictable from the initial footage of the film. There are many things in the world that are just unpredictable. If you train a system to make predictions at the pixel level, it will spend all its resources trying to figure out details that it simply can't invent. This is a complete waste of resources. Every time we've tried, and I've been working on this for 20 years, using a self-supervised learning training system by predicting videos doesn't work. It is only valid if it is done at the presentation level. This means that those schemas are not generative. Bill Dally: If you're basically saying transformers don't have that ability, but people have vision transformers and get great results. Yann LeCun: I didn't mean that, because you can use a transformer for that. You can put transformers in those architectures. It's just that the kind of architecture I'm talking about is called joint embedding predictive architecture. So, take a movie or an image or whatever, run it through an encoder, you get a representation, and then take the subsequent parts of that converted version of that text, movie, or image, and also run it through an encoder, and now try to make predictions in that representation space, not in the input space. You can use the same training method, which is fill in the blanks, but you do it in this latent space rather than in the original representation. Yann LeCun: The hard part is that if you're not careful and don't use clever technology, the system will crash. It ignores the input entirely, producing only a constant, non-existent amount of input information.

AGI-1.8%

GTC-0.67%

O3-16.11%

View Original

The content is for reference only, not a solicitation or offer. No investment, tax, or legal advice provided. See Disclaimer for more risks disclosure.

1 Likes