Statisticians like to insist that correlation should not be confused with causation. Most of us intuitively understand this actually not a very subtle difference. We know that correlation is in many ways weaker than causal relationship. A causal relationship invokes some mechanics, some process by which one process influences another. A mere correlation simply means that two processes just happened to exhibit some relationship, perhaps by chance, perhaps influenced by yet another unobserved process, perhaps by an entire chain of unobserved and seemingly unrelated processes.
When we rely on correlation, we can have models that are very often correct in their predictions, but they might be correct for all the wrong reasons. This distinction between weak, statistical relationship and a lot stronger, mechanistic, direct, dynamical, causal relationship is really at the core of what in my mind is the fatal weakness in contemporary approach in AI.
The argument
Let me role play, what I think is a distilled version of a dialog between an AI enthusiast and a skeptic like myself:
AI enthusiast: Look at all these wonderful things we can do now using deep learning. We can recognize images, generate images, generate reasonable answers to questions, this is amazing, we are close to AGI.
Skeptic: Some things work great indeed, but the way we train these models is a bit suspect. There doesn't seem to be a way for e.g. a visual deep learning model to understand the world the same way we do, since it never sees the relationships between objects, it merely discovers correlations between stimuli and labels. Similarly for text predicting LLMs and so on.
AI enthusiast: Maybe, but who cares, ultimately the thing works better than anything before. It even beats humans in some tasks, just a matter of time when it beats humans at everything.
Skeptic: You have to be very careful when you say that AI beats humans, we've seen numerous cases of data leakage, decaying performance with domain shift, specificity of dataset and so on. Humans are still very hard to beat at most of these tasks (see radiologists, and the discussions around breeds of dogs in ImageNet).
AI enthusiast: yes but there are some measurable ways to verify that machine gets better than a human. We can calculate average score over a set of examples and when that number exceeds that of a human, then it's game over.
Skeptic: Not really, this setup smuggles in a gigantic assumption that every mistake counts equal to any other and is evenly balanced out by a success. In real life this is not the case. What mistakes you make matters a lot, potentially a lot more to how frequently you make them. Lot's of small mistakes are not as bad as one fatal.
AI enthusiast: OK, but what about the Turing test, ultimately when humans get convinced that AI agent is sentient just as they are, it's game over, AGI is here.
Skeptic: Yes but none of the LLMs really passed any serious Turing test because of their occasional fatal mistakes.
AI enthusiast: But GPT can beat human at programming, can write better poems and makes fewer and fewer mistakes.
Skeptic: But the mistakes that it occasionally makes are quite ridiculous, unlike any human would have made. And that is a problem because we can't rely on a system which makes these unacceptable mistakes. We can't make any guarantees which we implicitly make for sane humans when applied to critical missions.
The overall position of a skeptic is that we can't just look at statistical measures of performance and ignore what is inside of the black-boxes we build. The kind of mistakes matter deeply and how these systems reach correct conclusion matters to. Yes we may not understand how brains work either, but empirically most healthy brains make similar kind of mistakes which are mostly non-fatal. Occasionally a "sick" brain will be making critical mistakes, but such ones are identified and prevented from e.g. operating machines or flying planes.
"How" matters
I've been arguing in this blog for better part of a decade now, that deep learning systems don't share the same perception mechanisms as humans [see e.g. 1]. Being right for the wrong reason is a really dangerous proposition and deep learning mastered beyond any expectations the art of being right for the (potentially) wrong reasons.
Arguably it is all a little bit more subtle than that. When we discover the world with our cognition we to fall for correlations and misinterpret causations. But from an evolutionary standpoint, there is a clear advantage of digging in deeper into a new phenomenon. Mere correlation is a bit like first order approximation of something but if we are in the position to get higher order approximations we spontaneously and without much thinking dig in. If successful, such pursuit may lead us to discovering the "mechanism" behind something. We remove the shroud of correlation, we now know "how" something works. There is nothing in modern-day machine learning systems that would incentivize them to make that extra step, that transcendence from statistics to dynamics. Deep learning hunts for correlations and couldn't give a damn if they are spurious or not. Since we optimize averages of fit measures over entire datasets, there could even be a "logical" counter example debunking a "theory" a machine learning model has built, but it will get voted out by all the supporting evidence.
This of course is in stark contrast to our cognition in which a single counter-example can demolish an entire lifetime of evidence. Our complex environment is full of such asymmetries, which are not reflected in idealized machine learning optimization functions.
Chatbots
And this brings us back to chatbots and their truth-fullness. First of all ascribing to them any intention of lying or being truthful is already a dangerous anthropomorphisation. Truth is a correspondence of language descriptions to some objective properties of reality. Large language models could not care less about reality or any such correspondence. There is no part of their objective function that would encapsulate such relations. Rather they just want to come up with the next most probable word conditioned by what already has been written along with the prompt. There is nothing about truth, or relation to reality here. Nothing. And never will be. There is perhaps a shadow of "truthfulness" reflected in the written text itself, as in perhaps some things that aren't true are not written down nearly as frequently as those that are. And hence the LLM can at least get a whiff of that. But that is an extremely superficial and shallow concept, not to be relied upon. Not to mention that the truthfulness of statements may depend on their broader context which can easily flip the meaning of any subsequent sentence.
So LLMs don't lie. They are not capable of lying. They are not capable of telling the truth either. They just generate coherently sounding text which we then can interpret as either truthful or not. This is not a bug. This is totally a feature.
Google search doesn't and shouldn't be used to judge truthfulness either, it's merely a search based on page rank. But over the years we've learned to build a model for reputation of sources. We get our search results look at them and decide if they are trustworthy or not. This could range from reputation of the site itself, other content of the site, context of information, reputation of who posted the information, typos, tone of expression, style of writing. GPT ingests all that and mixes up like a giant information blender. The resulting tasty mush drops all the contextual tips that would help us to estimate worthiness and to make matters worse wraps everything in a convincing authoritative tone.
Twitter is a terrible source of information about progress in AI
What I did in this blog from the very beginning was to take all the enthusiastic claims about what AI systems can do, try it for myself on new, unseen data, and draw my own conclusions. I asked GPT numerous programming questions, just not typical run of the mill quiz questions from programming interviews. It failed miserably almost all of them. Ranging from confidently solving a completely different problem, to introducing various stupid bugs. I tried it with math and logic.
ChatGPT was terrible, Bing aka GPT4 much better (still a far cry from professional computer algebra systems such as Maple from 20 years ago), but I'm willing to bet GPT4 has been equipped with "undocumented" symbolic plugins that handle a lot of math related queries (just like the plugins you can now "install" such as WolframAlpha etc). Gary Marcus who has been arguing for merger of neuro with symbolic must feel a bit of a vindication, though I really think OpenAI and Microsoft should at least give him some credits for being correct. Anyway, bottom line: based on my own experience with GPT and stable diffusion I'm again reminded that twitter is a terrible source of information about the actual capabilities of those systems. Selection bias and positivity bias are enormous. Examples are totally cherrypicked, and the enthusiasm with which prominent "thought leaders" in this field celebrate these totally biased samples is mesmerizing. People who really should understand the perils of cherrypicking seem to be totally oblivious to it when it serves their agenda.
Prediction as an objective
Going back to LLMs there is something curious about them that brings them back to my own pet project - the predictive vision model - both are self-supervised and rely on predicting "next in sequence". I think LLMs show just how powerful that paradigm can be. I just don't think language is the right dynamical system to model and expect real cognition. Language is already a refined, chunked and abstracted shadow of reality. Yes it inherits some properties of the world within its own rules, but ultimately it is a very distant projection of real world. I would definitely still like to see that same paradigm but applied to vision, preferably as raw sensor input as can be.
Broader perspective
Finally I'd like to cover one more thing - we are some good 10 years into the AI gold rush. Popular narrative is that this is a wondrous era, and each new contraption such as ChatGPT is just yet more proof of the inevitable and rapidly approaching singularity. I never bought it. I'd don't buy it now either. The whole singularity movement reeks of religious like narratives and is completely non-scientific or rational. But truth is - we spent, by conservative estimates, at least 100 billion dollars in this AI frenzy. What did we really get out of it?
Despite massive gaslighting by the handful of remaining companies, self driving cars are nothing but a very limited, geofenced demo. Tesla FSD is a joke. GPT is great until you realize 50% of its output is a totally manufactured confabulation with zero connection to reality. Stable diffusion is great, until you actually need to generate a picture that is composed of parts not seen before in together in the training set (I spent hours on stable diffusion trying to generate a featured image for this post, until I eventually gave up and made the one you see on top of this page using Pixelmator in roughly 15 minutes). At the end of the day, the most successful applications of AI are in broad visual effects field [see e.g. https://wonderdynamics.com/ or https://runwayml.com/ which are both quite excellent]. Notably VFX pipelines are OK with occasional errors since they can be fixed. But as far as critical, practical applications in the real world go, AI deployment has been nothing but a failure.
With 100B dollars, we could open 10 big nuclear power plants in this country. We could electrify and renovate the completely archaic US rail lines. It would not be enough to turn them to Japanese style high speed rail, but should be sufficient to get US rail lines out of late nineteenth century in which they are stuck now. We could build a fleet of nuclear powered cargo ships and revolutionize global shipping. We could build several new cities and a million houses. But we decided to invest in AI that can get us better VFX, flurry of GPT based chat apps and creepy looking illustrations.
I'm really not sure if in 100 years current period will be regarded as this amazing second industrial revolution AI apologists love to talk about or rather a period of irresponsible exuberance and massive misallocation of capital. Time will tell.
If you found an error, highlight it and press Shift + Enter or click here to inform us.