Hacker News

Writing an LLM from scratch, part 13 – attention heads are dumb

351 points by gpjt 10 months ago

70 Comments
The most clarifying post I've read on attention is from Cosma Shalizi[0] who points out that "Attention" is quite literally just a re-discovery/re-invention of Kernel smoothing. Probably less helpful if you don't come from a quantitative background, but if you do it makes it shockingly clarifying.
Once you realize this "Multi-headed Attention" is just kernel smoothing with more kernels and doing some linear transformation on the results of these (in practice: average or add)!
0. http://bactra.org/notebooks/nn-attention-and-transformers.ht...

By crystal_revenge 10 months ago
It's a useful realization, too, since ways of approximating kernel functions are already well-studied. Google themselves have been publishing in this area for years, e.g. https://research.google/blog/rethinking-attention-with-perfo...
> To resolve these issues, we introduce the Performer, a Transformer architecture with attention mechanisms that scale linearly, thus enabling faster training while allowing the model to process longer lengths, as required for certain image datasets such as ImageNet64 and text datasets such as PG-19. The Performer uses an efficient (linear) generalized attention framework, which allows a broad class of attention mechanisms based on different similarity measures (kernels). The framework is implemented by our novel Fast Attention Via Positive Orthogonal Random Features (FAVOR+) algorithm, which provides scalable low-variance and unbiased estimation of attention mechanisms that can be expressed by random feature map decompositions (in particular, regular softmax-attention). We obtain strong accuracy guarantees for this method while preserving linear space and time complexity, which can also be applied to standalone softmax operations.

By FreakLegion 10 months ago
So do we think that some form of this is what they are using internally to get those long context lengths by already using sub-quadratic architectures [1] in their deployed models?
1: "Context Is The Next Frontier by Jacob Buckman, CEO of Manifest AI" (https://youtu.be/wJyl6kBCwmY?si=ruxMWdENjazu3rp6)

By snthpy 10 months ago
For those who don't know the term "kernel smoothing", it just means
```
    ∑ᵢ yᵢ · K(xᵢ, xₒ) ⁄ (∑ⱼ K(xⱼ, xₒ))
```
In regular attention, we let K(xᵢ, xₒ) = exp(<xᵢ, xₒ>).
Note that in Attention we use K(qᵢ, kₒ) where the q (query) and k (key) vectors are not the same.
Unless you define K(xᵢ, xₒ) = exp(<W_q xᵢ, W_k xₒ>) as you do in self-attention.
There are also some attention mechanisms that don't use the normalization term, (∑ⱼ K(xⱼ, xₒ)), but most do.

By thomasahle 10 months ago
> ∑ᵢ yᵢ · K(xᵢ, xₒ) ⁄ (∑ⱼ K(xⱼ, xₒ))
That clarifies things...

By throwup238 10 months ago
It does, for anybody who studied math at the level you need to understand Attention (some linear algebra). Please no low effort comments, ask if you don't have the math background and people will gladly help. This is sum notation, see https://en.m.wikipedia.org/wiki/Summation

By mrks_hy 10 months ago
No, it doesn't really clarify things. I had the best linear algebra grades in my year at my university, and if you don't know anything about kernels, this is not helpful (what are xi and yi in the first place?).

By auggierose 10 months ago
It's all described in the referenced link. No need for everyone to get antsy.
> 0. http://bactra.org/notebooks/nn-attention-and-transformers.ht...

By lambdasquirrel 10 months ago
In kernel methods the kernel is typically given, and things like positional embeddings, layer normalization, causal masking, and so on are missing. Kernel methods did not take off partly due to their computational complexity (quadratic in sample size), and transforms did precisely because they were parallelizable, and thus computationally efficient, compared with the RNNs and LSTMs that came before them.
Reductions of one architecture to another are usually more enlightening from a theoretical perspective than a practical one.

By esafak 10 months ago
Interesting read. But to me, describing attention as "a weighted mean of a continous learned k-v map lookup" is direct and descriptive, while saying "it's just kernel smoothing" is opaque and referential.

By sooheon 10 months ago
Maybe take the best of both worlds: "It's just kernel smoothing, which is a weighted mean of a continuous learned kv-map lookup."
Now you got the definition, and the alternative naming you can use to search for resources.

By ablob 10 months ago
Wow, thanks for referencing that. What a very detailed and long read!

By 3abiton 10 months ago
So do we think that some form of this is what they are using internally to get those long context lengths by already using sub-quadratic architectures [1] in their deployed models?
1: "Context Is The Next Frontier by Jacob Buckman, CEO of Manifest AI" (https://youtu.be/wJyl6kBCwmY?si=ruxMWdENjazu3rp6)

By snthpy 10 months ago
If you are interested in this sort of thing, you might want to take a look at a very simple neural network with two attention heads that runs right in the browser in pure Javascript, you can view source on this implementation:
https://taonexus.com/mini-transformer-in-js.html
Even after training for a hundred epochs it really doesn't work very well (you can test it in the Inference tab after training it), but it doesn't use any libraries, so you can see the math itself in action in the source code.

By logicallee 10 months ago
Regarding this statement about semantic space:
> so long as vectors are roughly the same length, the dot product is an indication of how similar they are.
This potential length difference is the reason "Cosine Similarity" is used instead of dot products for concept comparisons. Cosine similarity is like a 'scale independent dot product', which represents a concept of similarity, independent of "signal strength".
However, if two vectors point in the same direction, but one is 'longer' (higher magnitude) than the other, then what that indicates "semantically" is that the longer vector is indeed a "stronger signal" of the same concept. So if "happy" has a vector direction then "very happy" should be longer vector but in the same direction.
Makes me wonder if there's a way to impose a "corrective" force upon model weights evolution during training so that words like "more" prefixed in front of a string can be guaranteed to encode as a vector multiple of said string? Not sure how that would work with back-propagation, but applying certain common sense knowledge about how the semantic space structures "must be" shaped could potentially be the next frontier of LLM development beyond transformers (and by transformers I really mean the attention heads specialization)

By quantadev 10 months ago
This looks very interesting. The easiest way to navigate to the start of this series of articles seems to be https://www.gilesthomas.com/til-deep-dives/page/2
Now if I only could find some time...

By andrehacker 10 months ago
maybe it renders differently on mobile but this was the first entry for me. you can use the nav at the end to continue to the next part
https://www.gilesthomas.com/2024/12/llm-from-scratch-1

By Tokumei-no-hito 10 months ago
Author here: I endorse this comment ;-) That's definitely the route I've optimised for for reading the series.

By gpjt 10 months ago
https://news.ycombinator.com/from?site=gilesthomas.com

By sitkack 10 months ago
Too bad the book seems to be using Python and some external library like tiktokens just from chapter 2, meaning that it'll basically stop working next week or so, like everything Python, making the whole thing much harder to follow in the future.
Meanwhile i learned the basics of machine learning and (mainly) neural networks from a book written in 1997[0] - which i read last year[1]. It barely had any code and that code was written in C, meaning it'd still more or less work (though i didn't had to try it since the book descriptions were fine on their own).
Now, Python was supposedly designed to look kinda like pseudocode, so using it for a book could be fine, but at least it should avoid relying on external libraries that do not come with the language itself - and preferably stick to stuff that have equivalent to other languages too.
[0] https://www.cs.cmu.edu/~tom/mlbook.html
[1] which is why i make this comment (and to address the apparent downvotes): even if i get the book now i might end up reading it in 3-4 years. Stuff not working will be a major obstacle. If the book is good, it might end up been recommended by people 2-3 years from now and some people may end up getting it and/or reading it even later in time. So it is important for the book to be self-contained, at least when it comes to books that try to teach the theory/ideas behind things.

By badsectoracula 10 months ago
Myeah, C and C++ have the advantage that the compilers support compile for old versions of the language. The languages are in much flux partly because of security problems, partly because features are added from other languages. That means that linking to external libraries using the older language version will fail unless you keep the old version around simply because the maintainer of the external library DID upgrade.
Python is not popular in ML because it is a great language but because of the ecosystem: numpy, pandas, pytorch and everything built on those allows you to do the higher level ML coding without having to reinvent efficient matrix operations for a given hardware infrastructure.

By andrehacker 10 months ago
>Python is not popular in ML because it is a great language but because of the ecosystem: numpy, pandas, pytorch and everything built on those allows you to do the higher level ML coding without having to reinvent efficient matrix operations for a given hardware infrastructure.
Ecosystems don't poof into existence. There are reasons people chose to write those libraries, sometimes partly or wholly in other languages for python in the first place.
It's not like python was older than or a more prominent language than say C when those libraries began.

By famouswaffles 10 months ago
(i assume with "The languages are in much flux" you meant python and not c/c++ because these aren't in flux)
Yeah i get why Python is currently used[0] and for a theory-focused book Python would still work to outline the algorithms - worst case you boot up an old version of Python in Docker or a VM, but it'd still require using only what is available out of the box in Python. And depending on how the book is written, it may not even be necessary.
That said there are other alternatives nowadays and when trying to learn the theory you may not need to use the most efficient stuff. Using C, C++, Go, Java, C# or whatever other language with a decent backwards compatibility track record (so that it can work in 5-10 years) should be possible and all of these should have some small (if not necessarily uberefficient) library for the calculations you may want to do that you can distribute alongside the book for those who want to try the code out.
[0] even if i wish people would stick on using it only for the testing/experimentation phase and move to something more robust and future proof for stuff meant to be used by others

By badsectoracula 10 months ago
"The languages are in much flux" you meant python and not c/c++ because these aren't in flux
No I meant C++.
2011 14882:2011[44] C++11
2014 14882:2014[45] C++14
2017 14882:2017[46] C++17
2020 14882:2020[47] C++20
2024 14882:2024[17] C++23
That is 4 major language changes in 10 years.
As a S/W manager in an enterprise context having to coordinate upgrades of multi-million LOC codebases for mandated security compliance, C++ is not the silver bullet in handling the version problem that exists in every eco system.
As said, the compilers/linkers allow you to run in compatibility mode so as long as you don't care about the new features (and the company you work for doesn't) then, yes, C/C++ is easier for managing legacy code.

By andrehacker 10 months ago
These are new features. Many of them are part of the library not the language. Generally speaking what you do is enable the new features in your compiler, you don't need to disable that to compile old code. It's not a problem to work on legacy code and use new features for new code either.

By YZF 10 months ago
I can show you a 2 year old python program that will fail to work on an existing version of python.
Can you do that with a gcc version?

By guappa 10 months ago
AFAIK none of this cause existing code to not work.

By badsectoracula 10 months ago
Scott Myers, author of the extremely popular Effective C++ explainers, gave up on the language because of its churn.
https://www.youtube.com/watch?v=KAWA1DuvCnQ

By AlexCoventry 10 months ago
C++ has issues with becoming more bloated and complicated but it does not have issues with having existing code breaking due to these changes (aside perhaps some very edgy edge cases).

By badsectoracula 10 months ago
This video is eleven years old. Contemporary C++ is very nice language, the only thing that I miss is a standalone REPL.
But, I think, current code editors can be viewed as same thing.

By thesz 10 months ago
> That means that linking to external libraries using the older language version will fail unless you keep the old version around simply because the maintainer of the external library DID upgrade.
This just isn’t true. C ABIs has not seen any change with the updated standards and while C++ doesn’t have a stable ABI boundary you shouldn’t have any problem calling older binary interfaces from new code (or new binary interfaces from old code provided you’re not using some new types that just aren’t available). That’s because the standard library authors themselves do strive to guarantee ABI comparability (or at least libc++ and stdlibc++ - I’m not as confident about MSVC but I have to believe this is generally true there too). Indeed the last ABI breakage in c++ was on Linux in C++11 15 years ago because of changes to std::string.

By vlovich123 10 months ago
not sure if rage bait or serious, but: have you ever heard of conda or virtual environment?

By y42 10 months ago
These do not solve fundamental issues as mentioned elsewhere.
Hell, just a few weeks ago i wanted to try out some tool someone made in Python (available as just the Python code specific to that project off github) - but that was broken because there was some minor compatibility breakage between the version the author used and the current one. The breakage wasn't in the tool itself but in some dependency 3-4 layers down (i.e. some dependency of a dependency of a dependency). Meanwhile the exact version of Python used wasn't available in the distro packages.
Eventually i solved it by downloading the exact Python version source code inside a minimal Debian docker container, compiling it and installing the rest via pip (meaning the only thing i changed was the Python version, not anything else).

By badsectoracula 10 months ago
Neither conda nor venv really solve the underlying problem.
Most ml project authors only supply a "top-level" requirements file, that is, one that only contains the list of dependencies directly required by the project, not a full graph of both direct and indirect dependencies.
Those dependencies are often unversioned. This means pip will fetch their latest versions, which might be incompatible with what the project actually requires, e.g. numpy 2 instead of the numpy 1.x that the project was actually developed with.
Even if the dependencies are versioned correctly, it's likely that one of those dependencies has its own improperly versioned dependency. Maybe the project doesn't depend on numpy directly, but it depends on foo 0.8.14, which depends on bar 3.11.4, which is sloppy and depends on improperly numpy, which will resolve to 2.x nowadays even though it actually needs 1.x.
You can `pip freeze`, but that doesn't always work, as the graph of required dependencies might differ across platforms and Python versions. It also has the problem of not distinguishing between the actual project constraints and those mechanistically added by pip.
Then there's the fact that pip installs dependencies one-by-one instead of looking at the whole list and figuring out what versions will work together. If you need both foo and bar, where foo needs numpy 1.x and the latest version of bar needs >2.0, you will get >2.0 and a broken foo, instead of a working foo an an older working bar that still worked with numpy 1. This is actually one problem that conda solves, at a serious performance cost.
Not to mention the fact that requirements files often fall "out of sync" with what is actually there. It is too easy to follow the suggestion in some exception and just `pip install` a missing package, forgetting to add it to requirements.txt.
Then there's the python version debacle, people often don't specify the version of Python needed, which sometimes just makes the requirements uninstallable (or, even worse, subtly broken).
The solution to all these problems is using uv. It keeps a package lock of the exact set of working versions (which is always resolved for multiple platforms), and which is separate from the version constraints imposed by the project author. It resolves packages all at once without Conda's performance problems, Rust probably helps here. It encourages commands like `uv add`, which install a package and track it as a dependency in a single step. It even tracks Python versions, fetching the exact version needed when necessary.

By miki123211 10 months ago
Those are decent options but you can still run into really ugly issues if you try to go back too far in time. An example I ran into in the last year or two was a Python library that linked against the system OpenSSL. A chain of dependencies ultimately required a super old version of this library and it failed to compile against the current system OpenSSL. Had to use virtualenv inside a Docker container that was based on either Ubuntu 18.04 or 20.04 to get it all to work.

By tonyarkles 10 months ago
Wouldn't this be an issue with C too? Or anything that links against an external library?

By johnmaguire 10 months ago
To a certain extent. The big difference being how difficult it would be to work around or update. If my C code is linking against a library whose API has changed, I can update my code to use the new API so that it builds. In the case I ran into… it wasn’t one of my dependencies that failed to build, it was a dependency’s dependency’s dependency with no clear road out of the mess.
I could have done a whole series of pip install attempts to try to figure out the minimal version bump on my direct dependency that would use a version of the sub-dep that would compile and then adapt my Python code to use that version instead of the version it was pinned to originally.

By tonyarkles 10 months ago
[deleted]

By 10 months ago
Must be my ignorance but everytime I see explainers for LLMs similar to the post, it’s hard to believe that AGI is upon us. It just doesn’t feel that “intelligent” but again might just be my ignorance.

By westoque 10 months ago
It's never going to be AGI, because we're still stuck in the static weights era.
Just because it is theoretically possible to scale your way through sheer brute force alone using a trillion times the compute doesn't mean that you can't come up with a better compute scaling architecture that uses less energy.
It's the same as having a turing machine with one tape vs multiple tapes. In theory it changes nothing, in practice having even the simplest algorithms be quadratic is a huge drag.
The problem with previous AI approaches is that humans wanted to make use of their domain expertise and ended up anthropomorphizing the ML models, which resulted in them being overtaken by people who invested little in domain expertise and more into compute scaling. The quintessential bitter lesson. With the advent of the bitter lesson, people who don't understand anything at all except the concept "bigger is better" arrived, and they think that they can wring out blood from a stone. The problem they run into is that they are trying to get something out of compute scaling that you can't get out of compute scaling.
What they want to do is satisfy a problem definition using an architecture that is designed to solve a completely different problem definition. The AGI compute scaling crowd wants something that is capable of responding and learning through experience, out of something that is inherently designed and punished to not learn through experience. The key aspect "continual learning" does not rely on domain knowledge. It is a compute scaling paradigm, but it's not the same compute scaling paradigm that static weights represent. You can't bet on donkeys in a horse race and expect to win, but since everyone is bringing donkeys to the race it sure looks like you can.
My personal bet is that we will use self referential matrices and other meta learning strategies. The days of hand tuning learning rates to produce pre-baked weights should be over by the end of the decade.

By imtringued 10 months ago
[deleted]

By 10 months ago
Because LLMs successfully emulate a subset of our brain's functions: memory and imagination (the generative/mixing function). What's missing is our brain's ability to validate the generative output against a model of the environment described by memory and output (the real world), which is built on sensory input. In short, we have a concept of true/false, LLMs don't.

By hliyan 10 months ago
LLMs emulate language by following intricate links between tokens. This is not meant to emulate memory or imagination, just transforming a list of tokens into another list of tokens, generating language. And language is a huge part of the intelligence puzzle so it looks smart to people despite being quite mechanical.
A next step could be to create a mind, with a piece that works similar to the paretial lobe to give it a sense of self or temporal existence.

By nurettin 10 months ago
> it looks smart to people despite being quite mechanical
Note that brains themselves are also "quite mechanical", as is any physical system or piece of software. "Looks smart", in the limit, reduces to "is smart".

By dTal 10 months ago
Brains themselves have a lot more mechanisms to cause emergent behavior what with all the adaptive organic layers so I can't really compare the two 1-1.

By nurettin 10 months ago
eh, transformers are universal differentiable layered hash tables. that's incredibly powerful. most logic is just pulling symbols and matching structures with "hash"es.
if intelligence is just reasonable manipulation of logic it's unsurprising that an LLM could be intelligent, what maybe is surprising is that we have ~intelligence without going up a few more orders of magnitude in size, what's possibly more surprising is that training it on the internet got it doing the things it's doing

By throwawaymaths 10 months ago
[flagged]

By almostgotcaught 10 months ago
> I love these word-salad exaltations of LLMs - people don't even know what they're writing half the time but it sure is provocative!
Well FWIW I have done an implementation of Llama LLM (3 8-70b to be specific) in non-python so I do sort of know what I'm talking about.
I'm not the originator of the hash table analogy. I got it from here:
https://www.youtube.com/watch?v=iDulhoQ2pro
Hell, the vectors that are generated are called K/V, so. Yeah it's a hash table.
And the idea that first order logic and other facets of intelligence can just be cleverly arranged lookup tables comes from here:
https://en.wikipedia.org/wiki/Fluid_Concepts_and_Creative_An...

By throwawaymaths 10 months ago
As the author of the original post above, let me say that if that's word salad, it's a Michelin star salad. Just the right mix of lettuce and tomato, and the dressing is spot on :-)
Seriously, though, differentiable hash tables is an awesome way to look at them, I wish I'd heard it before.

By gpjt 10 months ago
Neurons are pretty simple too.
Any arbitrarily complex system must be made of simpler components, recursively down to arbitrary levels of simplicity. If you zoom in enough everything is dumb.

By jlawson 10 months ago
The deeper you break things down, the dumber they seem. But maybe that dumbness is just an illusion of the observer's perspective.
Consciousness isn’t in the neurons themselves—it's in the invisible coordination and tension between them.

By Zorass 10 months ago
Anything is simple if you approximate it to an adimensional point and ignore all the complexities that make it different from that.

By guappa 10 months ago
No, you misunderstood. I am describing [taking part of the whole] not [simplifying the whole] - is that clearer?

By jlawson 10 months ago
Neurons are surprisingly not simple. Vastly more complex than the ultra simplified model in artificial neural networks.

By voidspark 10 months ago
Most of the complexity is incidental to intelligence. It's mostly just the machinery of keeping the cell alive.
Most everything in biology is a clumsy hack accidentally discovered via evolution, and then optimised to death over aeons.
We can sidestep all that mess and extract just the core algorithm that is actually required for intelligence.

By jiggawatts 10 months ago
That is your assumption and it is wrong.
https://grok.com/share/bGVnYWN5_ab498084-58c4-4345-9140-07b5...
Biological Neuron: Processes information through complex, nonlinear integration of thousands of excitatory and inhibitory inputs across dendritic trees, producing spiking outputs with rich temporal patterns. It adapts dynamically via synaptic plasticity, neuromodulation, and structural changes, operating in a probabilistic, energy-efficient manner within oscillatory networks.
Artificial Neuron: Performs simple, linear summation of weighted inputs, applies a static activation function, and produces a single scalar output. It lacks temporal dynamics, local plasticity, or neuromodulation, operating deterministically with high computational cost and fixed connectivity.

By voidspark 10 months ago
This is interesting
https://chatgpt.com/share/68219da9-1e78-8007-b083-8a81bfbea2...
"Dendrites can implement non‑linear sub‑units and even logic‑gate‑like behavior before the soma integrates them, whereas the standard artificial neuron uses a plain weighted sum."
"Neurotransmitter diversity (e.g., glutamate, GABA, dopamine) allows different semantics on each connection. An artificial edge conveys only a signed scalar."

By voidspark 10 months ago
Neither are most functions, but locally, at a point, a linear approximation works just fine in practice.

By laichzeit0 10 months ago
https://news.ycombinator.com/item?id=43959553

By voidspark 10 months ago
There are multiple books about this topic now. What are your takes on the alternatives? Why did you choose this one? Appreciate your thoughts!

By theyinwhy 10 months ago
It is regarded to be "the best" book on the topic by many. I found just like what Giles Thomas wrote that the book focuses on the details and how to write the lower level code without providing the big picture.
I am personally not very interested in that as these details are likely to change rather quickly while the principles of LLMs and transformers will probably remain relevant for many years.
I have been looking for, but failed, to find a good resource that approaches it the way 3blue1brown [1] explains it but then go deeper from there.
The blog series from Giles seem to take the book and add the background to the details.
[1] https://m.youtube.com/watch?v=wjZofJX0v4M

By andrehacker 10 months ago
I'm reading through the book the blog mentions right now and building a small LLM. I'm only on chapter 2, but so far it's helped clarify a lot of things about LLMs and break it down into small steps. Highly recommend Building a large language model from scratch

By darrelld 10 months ago
I do wonder if it is in the book authors interest if some people blog and summarize the whole books content? Or even more interesting: Would it be fine if I let an LLM summarize a book and create such a series of blog posts?

By buster 10 months ago
Author of the post here -- I'm being careful not to do that. My posts are more about filling in the gaps; they're covering the things that aren't mentioned. The book's target audience is, I think, people with a bit more background knowledge about the inner workings of AI than I have, so I'm having to play catch-up a bit.

By gpjt 10 months ago
This is called Fair Use. While you're asking the question, everyone else is doing it.

By metadat 10 months ago
I think there are two layers of the 'why' in machine learning.
When you look at a model architecture it is described as a series of operations that produces the result.
There is a lower level why, which, while being far from easy to show, describes why it is that these algorithms produce the required result. You can show why it's a good idea to use cosine similarity, why cross entropy was chosen to express the measurement. In Transformers you can show that the the Q and K matrices transform the embeddings into spaces that allows different things to be closer, and using that control over the proportion of closeness allows you to make distinctions. This form of why is the explanation usually given in papers. It is possible to methodically show you will get the benefits described from techniques proposed.
The greater Why is much much harder, Harder to identify and harder to prove. the First why can tell you that something works, but it can't really tell you why it works in a way that can inform other techniques.
In the Transformer, the intuition is that the 'Why' is something along the lines of The Q transforms embeddings into an encoding of what information is needed in the embedding to resolve confusion, and that the K transforms embeddings into information to impart. When there's a match between 'What I want to know about' and 'what I know about' the V can be used as 'the things I know' to accumulate the information where it needs to be.
It's easy to see why this is the hard form, Once you get into the higher semantic descriptions of what is happening, it is much harder to prove that this is actually what is happening, or that it gives the benefits you think it might. Maybe Transformers don't work like that. Sometimes semantic relationships appear to be in processes when there is an unobserved quirk of the mathematics that makes the result coincidentally the same.
In a way I think of the maths of it as picking up a many dimentional object in each hand and magically rotating and (linearly) squishing them differently until they look aligned enough to see the relationship I'm looking at and poking those bits towards each other. I can't really think about that and the semantic "what things want to know about" at the same time, even though they are conceptualisations of the same operation.
The advantage of the lower why is that you can show that it works. The advantage of the upper why is that it can enable you to consider other mechanisms that might do the same function. They may be mathematically different but achieve the goal.
To take a much simpler example in computer graphics. There are many ways to draw a circle with simple loops processing mathematically provable descriptions of a circle. The Bressenham Circle drawing algorithm does so with a why that shows why it makes a circle but the "Why do it that way" was informed by a greater understanding of what the task being performed was.

By Lerc 10 months ago
A lot of why’s just don’t make sense to me at low level. It just feels like we need to address the issues in some way, so we make up something and brute force it with gradient descent with large data and enough computational power. It is unknown whether each design choice is a good idea, it will work anyway

By charlieyu1 10 months ago
I personally don't really see the point in giving meaning to the Q, K, V parts. It doesn't actually matter what Q, K, V do, it's the training algorithms' job to assign it a role automatically. It makes more sense to think of it as modeling capacity or representational space.
One of the biggest things people don't understand about machine learning is that there is a lot of information in the model that is only relevant to the training phase. It's similar to having test points for your probes on a production PCB or trace/debug logging that is disabled in production. This means that you could come up with an explanation that makes sense at training time, but is actually completely irrelevant at inference time.
For example, what you really want from attention is the pairing of all vectors in Q with all vectors in K. Why? Not necessarily because you need it for inference. It's so that you don't have to know or predict where the gradient will propagate in advance when designing your architecture. There are a lot of sparse attention variants that only really apply to inference time. They show you that transformers are doing a lot of redundant and unnecessary work, but you can only really know that after you're done training.
There is a pattern in LSTMs and Mamba models that is called gating [0], which in my opinion is a huge misnomer. Gating implies that the gate selectively stops the flow of information. What they really mean by it is element-wise multiplication.
If you look at this from the concept of model capacity, then what additional concepts and things does this multiplication allow you to represent? LSTMs are really good at modelling dynamical systems. Why is that the case? It's actually quite simple. Given a discretized linear dynamical system x_next = Ax + Bu, you run into a problem: You can only model linear systems.
So, assuming we only had matrix multiplications with activation functions, we would be limited to modeling linear dynamical systems. The key problem is that the matrices in the neural network can model A and B of the dynamical system, but they cannot have a time varying A and B. You could add additional parameters to x that contain the time varying parameters of A, but you will not be able to use these parameters to model non-linearity effectively.
Your linearization of a function f(x) might be written as g(x) = f(x_0) + f'(x_0)x = a_0 + a_1 * x. The problem is very apparent, while you can add an additional parameter to modify a_0, you can never modify a_1, since a_1 is baked into the matrix and multiplied with your x. What you want is a function like this h(x) = f(x_0) + f'(x_0)x = (a_0+m_0) + (a_1+m_1) * x, where m is a modifier value in x. In other words, we need the model to represent the multiplication m_1 * x and it turns out that this is exactly what the gating mechanism in LSTM models does.
This might look like a small victory, but it is actually enough of a victory to essentially model most non-linear behavior. You can now model the derivative in the hidden state, which also means you can model the derivative of the derivative, or the derivative of the derivative of the derivative. Of course it's still going to be an approximation, but a pretty good one if you ask me.
[0] https://en.wikipedia.org/wiki/Gating_mechanism

By imtringued 10 months ago
>I personally don't really see the point in giving meaning to the Q, K, V parts. It doesn't actually matter what Q, K, V do, it's the training algorithms' job to assign it a role automatically.
I was under the impression that the names Q K and V were historical more than anything. There is a definite sense of information flowing from the K to the Q because the V going to the next layer Q comes from the same index as the K.
I agree that it's up to the training to assign the role for the components, but there is still value in investigating the roles that are assigned. The higher level insights you can gather can lead to entirely different mechanisms to perform those roles.
That's very much what most model architectures are, efficiency guides. A multi layer perceptron with an input width the size of context_window*token_size would be capable of assigning rolls better than anything else, but at the cost of being both unfeasibly slow and unfeasibly large.
I'm a little surprised that there isn't a tiny model that generates a V on demand when it is accumulated with the attention weights, A little model that takes the Q and K values and the embeddings that they were generated from. That way when there is a partial match between the Q and K causing a decent attention value it can use the information of what parts of Q and K match to decide what V information is appropriate to pass on. It would be slower, and caching seems to be going in the other direction, but it seems like there is information that should be significant there that just isn't being used.

By Lerc 10 months ago
Off topic rant: I hate blog posts which quote the author's earlier posts. They should just reiterate if it is important or use a link if not. Otherwise it feels like they want to fill some space without any extra work. The old posts are not that groundbreaking, I assure you. /rant

By bornfreddy 10 months ago