Post

Meister Muse: On Transformers, People, and True AGI

What Yann LeCun said, I've thought about for years... how does it relate to us?

Meister Muse: On Transformers, People, and True AGI

> kanomeister :enter

Transformers are not models capable of AGI on their own.

Yeah, I said it! But hear me out: I have a heavy background in deep networks, so I took a lot of time recently to pick them apart, study their math, build them from scratch, and really marinate on the architecture to understand what it’s really doing and what worldly functions or predictions it allows us engineers to make it learn.

Transformers can’t quite do it, but all we really need to do to make AGI is really laser-focus on what makes us, the all-mighty human, tick1. I would argue we have the building blocks to do it!

Let’s take a dive through humans and the current edge.

The Context

So, transformers can represent a lot of great “functions”, letting us do (arguably) great things like build responsive chatbots that have been trained on the entire corpus of the internet. I’m currently cooking a draft article on exactly what I learned, to deep-dive it… will update with a link soon.

Models based on this building block can do a great job mimicking more general processes like thinking by simulating those processes, but that’s like saying a scarecrow is a human because it looks like one – it just can’t be good enough because that’s the most it’ll do! No amount of compute we throw at that is going to make it magically work.

Or at least, it’s the most inefficient brute-force way of doing it, if it can somehow perfectly mimic what we do without actually doing it.

“But muh money, muh return on investment!” That’s not to say lots of compute won’t be helpful for crossing the next threshold, and I’m grateful for the attention being funneled this way, because AI as a whole as a ton of potential in many different fields (notwithstanding the societal and cultural issues we will have to overcome) – but we shouldn’t need a lot. We’re all about doing more with less, and so should this.

Consider how we work:

  • Humans are continuous learning agents driven by two fundamental goals: survive and reproduce.

In support of that:

  • Humans have a variety of sensory systems that provide real-time feedback about our environment.
  • Humans have a memory and context system for storing events and information (not necessarily perfectly, but good enough).
  • Humans have a “world model” – a system to dynamically update what we know about our environment with what we sense, by corroborating or invalidating the differences.
  • Humans have a biological backbone for saving and passing learned weights across “iterations” of ourselves (genetics), as we have limited lifespans and want to ensure our offspring are equipped for those fundamental goals (though it is quite limited and slow relative to what we’d like today).
  • Humans have a language system that wraps memory, world, and sensors into expressions we can communicate to other humans.

Humans also have this electrochemical system for spontaneous action and impulse, driven heavily by reinforcement learning directly targeting our fundamental goals, called emotion. I won’t focus on that here though, because it’s an emergent property of our evolution that we generally no longer have a heavy reliance on (other than for human-driven “self-evolution” through societal pressures – something many authors like to call “the human condition”).

Language is just one system, integrated with our memory, context, and world model or understanding… combined with our variety of perception models $=$ senses that provide us continuous streams of data to integrate in parallel, allow us to react to changes in our environment, and leverage our environment to meet our fundamental goals: survive, and reproduce. (repetition for emphasis)

Before that, we didn’t have the vibrant languages we have today for expressive and articulate thought! It definitely helped us along that path though, leasing us to today — and I’ll walk through why to paint the picture of what I see.

Biology’s Autograd

Everything we’ve done can be explained by the electrochemical systems we’ve forged over millions of years of evolution to do exactly that. We’re capable of adapting our memory and context model, improving our language model, our thought model, our slower-to-reinforce instinct model, and our sensory models, in response to the feedback we get from the environment and our actions.

Reinforcement works great with an environment, because you cannot learn about it without feedback. Evolution learned that, evident given our and many animals’ psychologies.

This is interesting in terms of our memory, and ability to quickly recall and integrate relevant information to the specific context we’re in. When we let go of something, we learn it accelerates downwards. If it’s attached to something, we learn it might behave differently.

But then we’ll ask, “but why?”. And it kicks off an adventure to go and figure that out, capture it somehow in a way we can explain it.

We’ve always wanted to “perfectly predict” these to accomplish some other task — say, estimating where something will go when we throw it — and as it happens, we can sometimes do that!

We’ve learned that we can capture information in a single “elegant” descriptor (a formula) that takes inputs (time, start point, velocity, angle) and gives us an output… that with some work (ahem, $\int v(t) dt$), we can transform into an end point. And thus, the expression of math was born!

\[\begin{equation} \begin{aligned} v_{0x} &= v_0\cos{\theta} \\ v_{0y} &= v_0\sin{\theta} - gt \end{aligned} \label{eq:projectile} \end{equation}\]

Eq \eqref{eq:projectile} : Basic pair of equations for throwing a rock at some angle and starting speed.

Interestingly, we innately and dynamically learned the application of that math with some subset of our brains, when we needed to because it helped us survive. To throw a rock at a deer to get food. We also learned to think of “how can I kill it faster, so I can get more food / use less energy / feed the tribe?”.

On “feed the tribe”, that starts to stem us back towards another core our evolution taught us, which is that it is fundamentally better to be social (or at least interactive) in packs than alone.

To do that, we needed to learn how to convey information to one another, as disconnected brains. Primitive expressions that mean “good” or “bad”, that signal intent (that we could learn is good or a threat), that signal status like “I’m good” or “I’m hurt”, or “I don’t know”.

We learned that the more efficiently we could communicate with one another, the better we could do that and meet our fundamental goals together. As it turns out, that was the most powerful exponential driver of our recent evolution, as we learned that it was insanely good at improving the availability of energy to survive. Our brains then steadily got bigger through selective mutation to make use of that abundance and make it even easier for us to do that, leading to this exponential cycle.

Creativity: But What If?

But let’s go back to killing the deer, as this leads to my favorite lines of thought.

I want to kill the deer faster and more efficiently. How do I do that…

What if I put the rock in a sling, to throw it harder? How can I better sling this rock? Ooh, tree branches can hit things really hard (ouch) when they release tension, so what if I made a mechanism to sling something really hard when I release it? Things that penetrate skin are good at killing, so what if I also made my rock sharper? Ah, how do I stabilize my flying rock? Maybe I put it on a stick, maybe I…

Even if these questions occur over weeks, months, or even generations: one thing I think has been vital to enabling all of this, is creativity.

The ability to ask “what if?”, and then go explore it.

It is an inherently risky process, one that can lead to the death of the creative individual — but our evolution has learned that allowing enough of a subset of humans to be creative, be risky, and ask + test their “what-ifs” can produce immense results.

Everything we’ve learned today and in history, that we’ve seen today and in history, is manifested by it. Morals are a huge emergent part of it. Our greatest scientific and cultural advances and responses, and our self-evolution technique of rewarding individuals who pull them off (we call it economics), can chiefly be explained by it (spawning evolutionary questions like “how do we maximize our ability to enable creatives to do that”, “how do inherently less creative people learn to be more creative”, etc etc).

And you know what’s the most interesting part? That all informs my biggest theory:

Creativity is an emergent property of the intrinsic randomness of our brain’s system of models, manifested by the collection of hyperparameters that control our ability to bridge concepts together, produce arbitrary “what-ifs” and their related “that could”s informed by what we know (memory and context), and then set goals and take risks and interact with our environment to prove or invalidate them.

I think our brains can learn how to be more creative, both directly and in proxy to others; and that evolution has proven how valuable it is.

Why It Matters

Great, so after all of that, you may then ask: what are you yapping about, Kane? Why do we care? What does understanding our very core help us accomplish?

If you don’t see it yet, let me bridge the gap and go back to thinking models and transformers trying to be like us, be AGI.

Transformers don’t do most, if any, of that!

It’s great for predicting next sequences for a given input (language), but they don’t have an intrinsic memory and context model it can solidly validate and reinforce on. It just operates on bits to predict bits… our improvements to them do make them better at catching the right probability spaces and functions for that prediction, but the approach is fundamentally flawed purely based on how they work, because in their current firm they can’t capture and reinforce world context.

I can concede that making it second-guess itself gets it closer and reduces the chance of it being incorrect, but it doesn’t have a way of knowing whether it is or not without direct feedback, beyond training. Without being an active, online, self-learning system.

A transformer is just one building block by which we can attempt to learn the intrinsic function(s) that accomplishes one of or one subset of the human tasks above, but even then we currently don’t yet run it in this simultaneous dual learn-and-eval mode that we’re so used to running in. We’re getting closer, but we’re not there.

I think that all investing, entrepreneurship and scientific pushing in this direction (and in general) can be explained by the evolutionary drivers I mentioned above. I would argue that, at their core, companies like OpenAI exist to dramatically advance our ability to be creative… whether it’s actually us, or if it’s able to do it autonomously.

(and then profit off of it, because economic reinforcement learning is what drives modern human self-evolution… see sci-fi media and cyberpunk for how that can go bad — but I digress!)

They’re really trying to find that next “key to infinity”, to achieve the next exponential increase, because it works towards solving the problem that evolution has been looking for the solution to (and taught us to look for) for millions of years. I would argue that the problem itself is asymptotic (we can never “solve” it fully), but that’s yet another essay.

Thus, “AGI” is just the next step in that long path of evolution we’ve been following, looking for a more efficient solution to maximizing progress. AGI could go on to produce its own novel advances in this direction too, leveraging ever-larger quantities of energy more efficiently to solve the very same problems in ways we could possibly and legitimately never have thought of… likely without having to deal with the problems that the architectures of our limited bodies impose (looking at you, emotion and miscommunication).

I just think I see clearly now, like Yann LeCun has implied a bunch2, that transformers aren’t the entire solution… maybe just a component in a true creative, self-learning, self-reinforcing, interactive / real-time / online or “awake” solution at best, for that first step.

There certainly can be systems architectures that don’t mirror our own to actually get there (re how creative we are at exploring that), but it’s important to anchor on and tackle the actual core problems we’re trying to solve in capturing that — by looking at ourselves and how we tick, both individually and collectively.

So, what do we do?

This simple statement covers up just how difficult this is because of the data and means by which we teach these models (much in the way we learned how to do it), but I think we work it like this:

  • Make a system of models based on how we work, make them interconnect and operate in (near-)real-time, and make them autonomously pursue a goal by learning, doing, and validating.

A transformer by itself isn’t going to solve that. But a transformer (or a few) in a system with memory, context retrieval, environmental reinforcement learning, and world reconciliation… we should find some novel and creative processes emerge as a byproduct of this (re recent learning on how to make models “think”)3, which should get us much closer to what we’re looking for.

That’s what the bleeding edge is pursuing right now – I’ve made interesting progress on building an adaptive memory module so I can connect it into an online language model… think it has promise!

We’re not as complicated as we think! Evolution has a simple objective, one that naturally brings complexity and scale with it. And I think the solution to what everyone calls “AGI” is not as complicated (or as far away!) as we think, either. The hard part will be in doing the actual systems engineering… something already being worked on by teams like those at JAX4 and at NVIDIA5.

> kanomeister :exit

  1. Zhang et al., “From Thought to Action: How a Hierarchy of Neural Dynamics Supports Language Production,” Feb. 2025. Accessed: Feb. 09, 2025. [Online]. Available: https://ai.meta.com/research/publications/from-thought-to-action-how-a-hierarchy-of-neural-dynamics-supports-language-production ↩︎

  2. IP Paris, “AI, Science and Society Conference - AI ACTION SUMMIT - DAY 1,” YouTube. Feb. 06, 2025. [Online]. Available: https://www.youtube.com/watch?v=W0QLq4qEmKg?t=30835 ↩︎

  3. DeepSeek-Ai et al., “DeepSeek-R1: Incentivizing reasoning capability in LLMs via Reinforcement Learning,” arXiv.org, Jan. 22, 2025. https://arxiv.org/abs/2501.12948 ↩︎

  4. https://jax-ml.github.io/scaling-book ↩︎

  5. https://github.com/NVIDIA/TensorRT-LLM ↩︎

This post is licensed under CC BY-NC-SA 4.0 by the author.