Attention-Encodings

One Double RoBERTa, with a Side of Strange

[The full demo tool, with the new model selection dropdown, is available for playing with here.] [Previous posts here and here] I added a model switcher to the attention-encoding demo: now you can toggle between roberta-base and roberta-large to compare how token representations evolve layer by layer. The differences: roberta-base: 12 layers, hidden size 768, 12 attention heads roberta-large: 24 layers, hidden size 1024, 16 attention heads So let’s try some of the previous roberta-base plots in roberta-large and see if large follows the same patterns. ...

What's Going On at Layer 5?

Project demo page Previous post Continuing on from the last post, you might have had the same question I did: What the heck, layer 5? Let’s go back to the relative ranking plot. This chart shows how many original vocabulary token embeddings are closer (by cosine distance) to a token’s current encoding at each layer than the token’s own original embedding. So if the current token encoding at position 1 is closest to its own original embedding, it has a rank of 1. If 100 other vocab embeddings are closer, the rank is 100. ...

More Attention to Attention

[Note: the full demo tool is available for fiddling with here.] A few months ago, I built a demo to visualize how token representations evolve inside the RoBERTa transformer model. Like a lot of people, I had the vague impression that transformers gradually contextualize each token layer by layer, refining its meaning as it flows through the network. But the more I looked, the less that held up. Depending on how I measured representational change (cosine similarity, vocabulary ranking, nearest neighbor overlap), I got completely different stories. This post is a short walkthrough of what I found, and how it challenged my assumptions about how and when transformers “make meaning.” ...

Paying Attention Part 2

A greatly-revised attention-encodings project is out and available here. I realized in the previous incarnation of this project I was comparing per-head attention outputs to value-encodings over the entire model vocabulary, which was a clever idea. The problem is that I forgot about the residual connection around the attention mechanism. So the project was essentially comparing the residual “deltas” against the original encodings, which might make some sense if you were just wondering how “far” an encoding moves at each layer. ...

Paying Attention to Attention

Fresh off the last experiment, I’ve decided to go back to an old experiment and dig a little deeper. A while ago I was working on an experiment trying to figure out what attention layers are actually doing in transformers. The traditional introduction to transformers goes something like this: You turn your text into tokens You create embeddings for each of those tokens Some additions and projections later, that token encoding goes into an attention layer, at a particular spot in the sequence based on the order of the text. Let’s assume our token is at position 3. The attention layer compares that token encoding to all the others in the sequence (query and keys, yes yes), and uses the relative weights of their similarities as weights to apply to the original embeddings in the sequence (projected by a value matrix.) (This won’t make sense if you don’t already know how transformers work, for which I apologize.) The output of that attention layer at position 3 is the original token, but with information from closely-related tokens mashed into it. So it’s a wider, more conceptually complex representation of the original token at position 3. Do this 12 times with some standard fully-connected neural networks in-between, and you end up with a new sequence where each encoding is a representation of that original token, but with “meaning” infused based on the overall context of every other token in the sequence. Use those fancy high-class output encodings to predict the next token, or classify your sequence, or whatever. But that “output of the attention layer is another form of the original token” has always been interesting to me. What does an encoding like that look like? Can you do something else with it? Does it relate back to the original token in an interesting way? ...