Attention-Encodings on This Might Be Something!

One Double RoBERTa, with a Side of Strange

Tue, 06 May 2025 12:45:57 -0700

Token encoding patterns in roberta-large are harder to spot than in roberta-base. Same architecture, more capacity, very different dynamics.

What's Going On at Layer 5?

Tue, 29 Apr 2025 13:25:15 -0700

In layer 5 of the RoBERTa transformer model, there's a huge jump in token encodings at attention layer 5. Which tokens are affected, and why?

More Attention to Attention

Tue, 22 Apr 2025 10:51:26 -0700

How much does a token's meaning shift across transformer layers? I compared three metrics and found that each one shows a different part of the picture.

Paying Attention Part 2

Fri, 08 Nov 2024 11:59:45 -0700

A greatly-revised attention-encodings project is out and available here.

I realized in the previous incarnation of this project I was comparing per-head attention outputs to value-encodings over the entire model vocabulary, which was a clever idea. The problem is that I forgot about the residual connection around the attention mechanism. So the project was essentially comparing the residual “deltas” against the original encodings, which might make some sense if you were just wondering how “far” an encoding moves at each layer.

Paying Attention to Attention

Thu, 31 Oct 2024 10:15:13 -0700

Fresh off the last experiment, I’ve decided to go back to an old experiment and dig a little deeper.

A while ago I was working on an experiment trying to figure out what attention layers are actually doing in transformers.

The traditional introduction to transformers goes something like this:

You turn your text into tokens
You create embeddings for each of those tokens
Some additions and projections later, that token encoding goes into an attention layer, at a particular spot in the sequence based on the order of the text. Let’s assume our token is at position 3.
The attention layer compares that token encoding to all the others in the sequence (query and keys, yes yes), and uses the relative weights of their similarities as weights to apply to the original embeddings in the sequence (projected by a value matrix.) (This won’t make sense if you don’t already know how transformers work, for which I apologize.)
The output of that attention layer at position 3 is the original token, but with information from closely-related tokens mashed into it. So it’s a wider, more conceptually complex representation of the original token at position 3.
Do this 12 times with some standard fully-connected neural networks in-between, and you end up with a new sequence where each encoding is a representation of that original token, but with “meaning” infused based on the overall context of every other token in the sequence.
Use those fancy high-class output encodings to predict the next token, or classify your sequence, or whatever.

But that “output of the attention layer is another form of the original token” has always been interesting to me. What does an encoding like that look like? Can you do something else with it? Does it relate back to the original token in an interesting way?