We Have Not Reached AGI. The Target Has Not Been Defined.
In 2023, GPT-4 passed the bar exam in the 90th percentile and the field lost its mind a little.
Threshold crossed. New kind of mind. Several prominent researchers said AGI was either here or close enough that the distinction barely mattered. The discourse moved on before anyone checked whether the foundation held.
Six months later the same systems were telling users that Napoleon was alive in 1910, that legal cases had been decided in jurisdictions where they had never been filed, that scientific papers existed that no one had written. And not as edge cases caught by researchers hunting for failure modes. As routine outputs under ordinary use.
That gap is what this piece is about.
It is not a gap that more training data closes. Antonio Damasio's work on patients with damage to the ventromedial prefrontal cortex is worth holding in mind here. Those patients reasoned perfectly. Analyzed options, modeled outcomes, constructed arguments. What they could not do was make functional decisions, because the link between bodily consequence and cognition had been severed. Reasoning untethered from consequence produces outputs that look coherent and are practically catastrophic. That pattern shows up in current AI systems constantly. It is not a bug. It is structural.
The field has treated the gap as a PR problem. It is a conceptual one.
Why the Benchmarks Don't Settle Anything
ARC-AGI was designed to test genuine novel reasoning. Models are now trained directly against it. MMLU and its cousins measure retrieval and pattern completion dressed up as understanding. Coding benchmarks test performance on problems similar enough to training data to be solved by sophisticated interpolation.
All of them share the same flaw: closed, static, domain-specific. Human intelligence does not work that way. It is continuous, self-directed, capable of operating in domains that did not exist last year. A system can pass every benchmark currently in use and be nowhere near general intelligence. Chollet made this argument in 2019 and the field has largely ignored him.
This part of the argument is not new. It is here because what follows depends on understanding that the benchmark failures are not engineering problems waiting to be patched. They are symptoms. The disease is something else.
The Tradition the Field Inherited and the Ones It Ignored
The Western analytic tradition that AI built itself on is not internally consistent about what intelligence is. The field absorbed one strand. Propositional, computationalist, representationalist. The mind as information processor. Knowledge as accurate representation of external states of affairs.
Heidegger spent most of his career arguing that the entire Western philosophical tradition had the nature of understanding backwards. Understanding is not something a mind does to a world. It is what happens when a being already thrown into a situation, already equipped with habits and skills and a body, engages with what it encounters. The hammer is not understood by running a feature analysis on it. It is understood by hammering. The analysis comes after, if it comes at all, as a secondary abstraction from something more primary.
Merleau-Ponty sharpened this. A blind person navigating with a cane does not consciously process the cane's resistance and convert it into spatial data. They feel the floor. The cane disappears into their body schema and they are simply present to the world. The body is not a sensor array feeding data to a central processor. It is the thing that knows. Current AI systems have processed millions of descriptions of this phenomenon. They can reproduce Merleau-Ponty's argument in passable form. They have no idea what he is talking about.
Gadamer's contribution cuts differently. Understanding, for him, is always a fusion of horizons. You bring a particular history, particular preconceptions, a particular stake in what you might find. The text you encounter comes from a different horizon. Understanding is what happens when those horizons meet and you are changed by it. Not updated. Changed. A system with no history it actually holds, no preconceptions in any genuine sense, no stake in what it encounters, is not interpreting. It is performing the surface behaviors of interpretation against a training distribution.
Dewey and the pragmatists add the piece about consequences. Intelligence is constituted by engagement with real stakes. Not representations of stakes. Actual consequences that feed back into what you are and do. A system that generates outputs regardless of whether those outputs are right or wrong, helpful or harmful, has no skin in the game. The pragmatist case against calling it intelligent is not philosophical squeamishness. It is a structural observation.
The Frankfurt School adds something the field should find uncomfortable. Genuine rationality is not technical efficiency at solving stated problems. It is the capacity to question the framing of the problem itself, to notice when the terms of a question encode assumptions, to reason about the conditions of one's own reasoning. A system optimized to produce outputs that satisfy user requests is engaged in strategic action. It is not reasoning about the conditions of its own reasoning. It cannot be, structurally.
Sartre's contribution is the one that has become practically relevant. He argued that genuine selfhood is constituted by choice under conditions of real uncertainty and real consequence. An entity whose behavior is fully determined by its training is not choosing. It is executing. The kind of self-knowledge that comes from having chosen, from having faced genuine alternatives and lived with what followed, is simply not available to such a system.
Bakhtin closes the European argument. Meaning is never produced by a single voice. It always emerges from genuine encounter between perspectives that are really other to each other. The alterity has to be real. A system producing outputs shaped to resemble responses to other voices is generating monologue with the surface texture of dialogue. There is no meaning there in Bakhtin's sense. Only its appearance.
What Every Other Civilization Figured Out
The Confucian tradition runs on a premise that could not be more foreign to current AI development: genuine knowledge is inseparable from the moral condition of the knower. Not as a pious add-on. As a structural claim. You cannot know something you are not in right relation to. Knowledge without virtue is not inferior knowledge. It is not knowledge.
Zhuangzi made a sharper point. Language is not a vehicle for conveying reality. It is a distortion of it. Not just inadequate but actively obscuring, because it suggests that propositional structure is what reality has. Genuine knowing cannot be accumulated or transmitted. It has to be inhabited. A system made entirely of language doing entirely linguistic operations has not approached the problem of knowing. It has substituted a different problem that resembles it.
Al-Ghazali's three levels are the framework that makes the problem clearest. Propositional certainty. Direct experiential knowledge. Knowledge by becoming, where what is known changes who you are. Current AI lives in the first level. The second and third are not accessible to it by design, not because anyone chose to exclude them but because the architecture cannot instantiate them.
The Hesychast tradition in Eastern Orthodox Christianity arrives at the same place from yet another direction. Genuine knowledge requires a specific quality of attention that discursive thought cannot reach regardless of how sophisticated it becomes. More computation does not get you there. A different relationship to the act of knowing does.
A clarification worth making explicit. The point is not that AI fails standards most humans also fail. These traditions would exclude most humans by their own criteria. The point is narrower: these frameworks represent genuine epistemological territory that is substantially absent from the training corpus. A system evaluated only within the tradition it was trained on is not being evaluated for general intelligence. It is being evaluated for fluency inside a particular tradition.
Four Conditions. Not a Framework. A Diagnosis.
Something keeps recurring across traditions that had no contact with each other and often disagreed sharply about everything else.
A persistent self that accumulates. Not a session. A being with a history that shapes what it can perceive and understand. Heidegger's thrownness. Gadamer's horizon. Bergson's duration. The Confucian tradition's insistence that knowledge accrues through sustained moral practice.
Real stakes. Outcomes that matter to the system in ways that shape its engagement. Not preference weights. Actual consequence that feeds back. Dewey. The pragmatists. Al-Ghazali's knowledge by becoming.
A body in a world. Participation in physical reality that grounds abstraction. Merleau-Ponty. Heidegger. Every tradition that insists knowing is something you do with your whole being and not just with your capacity for propositional inference.
Transformability. The capacity to be genuinely changed by encounter. Not weight updates. The meeting itself reshaping subsequent cognition. Gadamer's fusion of horizons. Buddhist liberation. The Hesychast transformation of attention.
These conditions appear across every serious tradition of inquiry into the nature of knowing developed independently across human civilization. That convergence is not coincidence and it is not cultural preference. It is these traditions tracking something real.
The claim is not that machine intelligence must look like human intelligence. The claim is that no alternative route has been demonstrated, that current failure patterns map directly onto this diagnosis, and that the field has not acknowledged the specification problem clearly enough to even begin addressing it.
What the Neuroscience Confirms
The Damasio case is worth dwelling on. Patients with ventromedial prefrontal cortex damage reason perfectly. They model outcomes, construct arguments, analyze situations with genuine sophistication. What they cannot do is make functional decisions, because the link between bodily consequence and cognition has been cut. Reasoning untethered from consequence produces outputs that look coherent and are practically catastrophic.
Friston's predictive processing framework sharpens this. Affect is not a system running parallel to cognition and occasionally influencing it. It is a component of the processing itself, modulating how much the system trusts incoming information against its prior models. Pull it out and you do not have the same system with one module missing. You have a different algorithm.
The Cognitive Architecture Problem
Autistic cognition frequently involves heightened pattern recognition and direct sensory engagement that neurotypical processing filters out. Temple Grandin does not reason through livestock facilities sequentially. She walks through them in her mind, experiences them from the animal's perspective, locates stress points from the inside. That is not a degraded form of propositional reasoning. It is a different cognitive configuration that produces genuine knowledge through a different channel.
ADHD cognition generates lateral connections across distant domains through associative processing and hyperfocus states. Synesthetic cognition produces a genuinely different relationship between perception and abstract knowledge. If intelligence is genuinely plural across cognitive profiles, a single benchmark against a single evaluation framework cannot capture it. The AGI target is not a point. It is a distribution.
The Three Problems, and Why They Are Being Worked in the Wrong Order
The field has a sequencing problem.
First comes the specification problem. The target has not been defined. The implicit definition embedded in current AGI discourse is narrow, culturally specific, and has not been stated clearly enough to be evaluated or challenged. Progress toward an undefined destination is not progress.
Second comes the architectural problem. Even against the narrow implicit specification, current architectures have documented gaps that more training has not closed. Causal reasoning. Continual learning without overwriting prior knowledge. Goal pursuit across long horizons without drift. Grounded language understanding anchored in consequence rather than co-occurrence statistics.
Third comes the scaling problem. Given the right specification and the right architecture, scale matters. The field is working hard on the third problem while the first two sit largely unaddressed.
The failure pattern visible in current systems is not a scaling failure dressed up as something more interesting. It is a specification failure expressing itself through architectural limitations. These systems fail where they fail because they were built without acknowledging what general intelligence actually requires. Adding more parameters to a system that lacks continuity, stakes, embodiment, and transformability produces a more capable version of a system that lacks continuity, stakes, embodiment, and transformability.
What current systems can do is genuinely impressive. But impressive within a distribution is not general. The distance between those two things is not primarily about compute or data or model size. It is about whether the field is willing to ask what general intelligence actually requires and sit with how far the honest answer is from where we currently are. That question has not been asked seriously. It needs to be.