Projects on This Might Be Something!

Kernel Shape in a CNN Audio Model

Thu, 15 May 2025 12:27:54 -0700

(Code on GitHub.)

Audio has a strong temporal component. Unlike an image, audio is a thing that happens in time, not an arrangement of items in a space. And yet many audio classification models treat spectrograms as if they were still images and not events, an artifact of early successes applying visual models to audio datasets.

I took the ESC-50 dataset, created a simple five-layer CNN model, and trained it with various kernel shapes and sizes. My hypothesis: kernels that extend more in the temporal dimension will have better performance.

One Double RoBERTa, with a Side of Strange

Tue, 06 May 2025 12:45:57 -0700

Token encoding patterns in roberta-large are harder to spot than in roberta-base. Same architecture, more capacity, very different dynamics.

What's Going On at Layer 5?

Tue, 29 Apr 2025 13:25:15 -0700

In layer 5 of the RoBERTa transformer model, there's a huge jump in token encodings at attention layer 5. Which tokens are affected, and why?

More Attention to Attention

Tue, 22 Apr 2025 10:51:26 -0700

How much does a token's meaning shift across transformer layers? I compared three metrics and found that each one shows a different part of the picture.

Paying Attention Part 2

Fri, 08 Nov 2024 11:59:45 -0700

A greatly-revised attention-encodings project is out and available here.

I realized in the previous incarnation of this project I was comparing per-head attention outputs to value-encodings over the entire model vocabulary, which was a clever idea. The problem is that I forgot about the residual connection around the attention mechanism. So the project was essentially comparing the residual “deltas” against the original encodings, which might make some sense if you were just wondering how “far” an encoding moves at each layer.

Paying Attention to Attention

Thu, 31 Oct 2024 10:15:13 -0700

Fresh off the last experiment, I’ve decided to go back to an old experiment and dig a little deeper.

A while ago I was working on an experiment trying to figure out what attention layers are actually doing in transformers.

The traditional introduction to transformers goes something like this:

You turn your text into tokens
You create embeddings for each of those tokens
Some additions and projections later, that token encoding goes into an attention layer, at a particular spot in the sequence based on the order of the text. Let’s assume our token is at position 3.
The attention layer compares that token encoding to all the others in the sequence (query and keys, yes yes), and uses the relative weights of their similarities as weights to apply to the original embeddings in the sequence (projected by a value matrix.) (This won’t make sense if you don’t already know how transformers work, for which I apologize.)
The output of that attention layer at position 3 is the original token, but with information from closely-related tokens mashed into it. So it’s a wider, more conceptually complex representation of the original token at position 3.
Do this 12 times with some standard fully-connected neural networks in-between, and you end up with a new sequence where each encoding is a representation of that original token, but with “meaning” infused based on the overall context of every other token in the sequence.
Use those fancy high-class output encodings to predict the next token, or classify your sequence, or whatever.

But that “output of the attention layer is another form of the original token” has always been interesting to me. What does an encoding like that look like? Can you do something else with it? Does it relate back to the original token in an interesting way?

Audio Tokens Part 18: The Wrap-Up

Mon, 07 Oct 2024 10:41:02 -0700

Updated 2025-04-22 with a brief intro for context.

This project explored whether short-time audio features (STFT slices) could be clustered into symbolic “tokens” and modeled using sequence architectures like BERT. It didn’t work out the way I’d hoped, but I figured out a lot about where this kind of approach breaks down, and what might be worth trying next. (Also, I spent a few days chasing phantom performance gains thanks to a classic extend() vs append() bug.)

⸻

Audio Tokens Part 17: All Sane, So Far

Thu, 03 Oct 2024 18:39:28 -0700

Here’s my checklist from the last post:

Look at a few more generated spectrograms.
- Do they look sane? Continue.
- Do they look insane? Fix the spectrograms!
They look sane. Moving on.
Try the spectrograms with a standard vanilla CNN of the type that is known to work well on spectrograms.
- Do the results improve significantly? End this round of this project and move on–it doesn’t work as-is.
- Do the results not improve significantly? Keep going.
They are roughly the same as all the other models. Peak val mAP I can get is 0.03-ish. Continuing.

Audio Tokens Part 16: Sanity Checks for Everyone!

Thu, 03 Oct 2024 10:17:07 -0700

I’ve been on a bit of a tear the past few days (right now at commit 3550615). I separated out the Dataset/DataLoader processing into its own classes,moved metrics calculation into its own class, did a bit of cleanup refactoring, and all of that so I could start just sending the raw STFT vectors into models as embeddings, and also add a new dirt-simple baseline model.

All of this to try to figure out if the consistently terrible val mAP results that seem to happen on every variation of model and hyperparameters are just because this idea doesn’t work as-is, or if there might be another bug in the preprocessing pipeline mucking things up.

Audio Tokens Part 15: Onward or Not?

Mon, 30 Sep 2024 10:48:36 -0700

Carrying on from the last post, I tried several different combinations of STFT, LSTM, and clustering parameters. Without going into detail about all the things I tried, we can just skip to the results:

Audio Tokens Part 14: Back to Square 0.01

Fri, 27 Sep 2024 10:23:00 -0700

For reasons that may be obvious from the previous post, I needed to take a day off. Way too much time spent on a Python append/extend mixup for my taste.

That bug has been in ModelTrainer from the initial commit. So all the metrics from the past few weeks were invalid. Now that that is fixed, the real, unadulterated results from the current setup are really, really, really not good.

Audio Tokens Part 13: The Big Bug

Wed, 25 Sep 2024 10:41:33 -0700

Batch clustering loops fixed, so large training sets are doable now.

Let’s run a ~80,000 sample from the training sets, with ~8000 as validation:

Audio Tokens Part 12: A Minor Bug Squash

Tue, 24 Sep 2024 14:25:29 -0700

I decided that rather than find out why things were so much worse before that one commit, I’d look for data leakage in the newer, better-performing code.

And I haven’t found any. The train/dev split is straighforward, using python indexing of a shuffled list of youtube ids. The centroids are computed only from training data. And the model training keeps the train and val data separate in two different Dataset classes.

Audio Tokens Part 11: The Hunt

Mon, 23 Sep 2024 10:56:55 -0700

Today I’m hunting down which commit moved my validation mAP from 0.15 to 0.41. One of two things happened here.

A large bug was fixed, and performance was improved.
A large bug was introduced, and the improved performance is a lie.

Time to track that down, so I’m going back commit by commit to find where the bump happened. Apparently I did so many small housekeeping commits back there that I just missed the big bump in reported performance. Let’s go back and find it:

Audio Tokens Part 10: Ruling Out the Obvious, Leaving Only the Obvious

Fri, 20 Sep 2024 11:08:49 -0700

I spend most of yesterday scouring the code for train/dev data leakage, and not much luck. I also fixed up the 1D pre-processing convolution I thought I had got working before so it works now. But back to the weird thing about those high metrics.

I also got most of the code ready for using more of the AudioSet training data, which mostly meant batching things instead of keeping the entire active training dataset in memory.

Audio Tokens Part 9: What the bug?

Thu, 19 Sep 2024 12:04:52 -0700

I going to crack open the AudioSet unbalanced_train set here. Only 20,000 training examples may not be enough. The unbalanced train set has around 2 MILLION examples. (The final eval set in AudioSet is only 20,000 items, which seems like not nearly enough, but then again I’m no fancy big-city Google researcher.) Which means I may need to stop keeping everything in memory during preprocessing, now doesn’t it?

So I rewrote a bunch of stuff yesterday to allow for a larger training set, consolidate up the train/dev split code, and clean up the TokenizedSpecDataset class.

Audio Tokens Part 8: Convoluted

Tue, 17 Sep 2024 11:03:29 -0700

[Update 19 Sep: Funny thing, this actually didn’t work out as planned. I forget to turn tokenization on for these examples. So the below change looks to be a random blip of some sort, not from adding convolution. Going to try to really add it later. Also, the comment below about running the same data over and over again accidentally was sort of true.]

Before switching to major, focus-shifting architecture changes here I figured I’d try something simpler first. From the last update, I’m taking the first option: run a basic 1D convolution on the STFT time-slices to try to make them a bit more frequency-invariant. Eight kernels, kernels size 3, and just concatting them together to create the new input for the K-Means clustering. The previous input, recall, was just the entire STFT time-slice vector as-is.

Audio Tokens Part 7: Reconsidering

Mon, 16 Sep 2024 11:28:56 -0700

Given the current (lack of) performance of this model, I’m spending some time rethinking some of the basic ideas. For those of you following along at home, you may have been shouting at the screen since Part 1 trying to get me to see one large flaw in my tokenizing setup.

Every STFT time slice is a set of particular frequencies at a moment in time. Which would probably work fine if the labels were about, for example, recognizing individual dogs. But our labels here are things like “dog barking” vs. “pneumatic drill” not “cute little Fluffy barking” vs “Butch barking right before he eats someone”. Some dogs have high frequency barks and some have low frequency barks, and STFT vectors from those two dog barks would be nowhere near each other in the vector space.

Audio Tokens Part 6: Slightly Less Basic

Thu, 12 Sep 2024 10:34:54 -0700

In our last episode, I had managed to get a dead-simple model to overfit on the sequences when cranking up the number of tokens in the vocabulary. This probably means one of two things (or probably something in between):

The audio tokens have some useful information in them that can be generalized.
The audio tokens have no useful information in them, and the overfitting is just because the model is able to memorize the embedding averages when there are many more embeddings involved.

Let’s bet on the first one for now. Keep the spectrogram and token generation as-is, and try a slightly more complex model. Say hello to SimpleLSTMClassifier.

Audio Tokens Part 5: Back to Basics

Wed, 11 Sep 2024 11:35:29 -0700

OK, so things aren’t working as well as I’d hoped out of the box. Time to try a new box. Or a new metaphor.

I’m going to try a super-simple network just to see if I can get something to fit.

Enter SimpleTokenClassifier. Take the tokens, create embeddings, average pool them, shove them through a linear layer and see if anything useful comes out. With 50 tokens:

Audio Tokens Part 4: More Tokens!

Thu, 05 Sep 2024 16:54:43 -0700

Well, more types of tokens, anyway. Instead of a limited vocabulary of 50 tokens, let’s try something a little more interesting. Like 1000. [time passes] Ran it, here’s an excerpt:

2024-09-05 18:16:23,258 - INFO - Epoch 9
2024-09-05 18:16:23,258 - INFO - Train Loss: 0.0210, Train F1 (macro): 0.4990, Train F1 (micro): 0.9962, Train Hamming Loss: 0.0038, Train mAP: 0.0824
2024-09-05 18:16:23,258 - INFO - Val Loss: 0.0209, Val F1 (macro): 0.4991, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.0826
Training: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1022/1022 [07:40<00:00, 2.22it/s, loss=0.0169]
Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:20<00:00, 7.22it/s]
2024-09-05 18:24:33,739 - INFO - Epoch 10
2024-09-05 18:24:33,739 - INFO - Train Loss: 0.0210, Train F1 (macro): 0.4990, Train F1 (micro): 0.9962, Train Hamming Loss: 0.0038, Train mAP: 0.0810
2024-09-05 18:24:33,739 - INFO - Val Loss: 0.0209, Val F1 (macro): 0.4991, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.0893

That’s…terrible. Let’s try bumping the learning rate from 5e-5 to 1e-3 just to show that we’re serious.

Audio Tokens Part 3: The First Run

Thu, 05 Sep 2024 13:49:40 -0700

Time to try the first run at training the model. Let’s see what happens!

2024-09-05 13:59:48,289 - INFO - Epoch 1
2024-09-05 13:59:48,289 - INFO - Train Loss: 0.0429, Train F1 (macro): 0.4998, Train F1 (micro): 0.9946, Train Hamming Loss: 0.0054, Train mAP: 0.0106
2024-09-05 13:59:48,289 - INFO - Val Loss: 0.0210, Val F1 (macro): 0.4991, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.1074
Training: 100%|████████████████████████████████████| 1022/1022 [07:40<00:00, 2.22it/s, loss=0.0196]
Validating: 100%|█████████████████████████████████████████████████| 146/146 [00:20<00:00, 7.16it/s]
2024-09-05 14:07:58,594 - INFO - Epoch 2
2024-09-05 14:07:58,594 - INFO - Train Loss: 0.0208, Train F1 (macro): 0.5210, Train F1 (micro): 0.9962, Train Hamming Loss: 0.0038, Train mAP: 0.1135
2024-09-05 14:07:58,594 - INFO - Val Loss: 0.0205, Val F1 (macro): 0.5629, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.1144
Training: 100%|████████████████████████████████████| 1022/1022 [07:40<00:00, 2.22it/s, loss=0.0153]
Validating: 100%|█████████████████████████████████████████████████| 146/146 [00:20<00:00, 7.17it/s]
[...]
2024-09-05 15:21:42,568 - INFO - Epoch 11
2024-09-05 15:21:42,568 - INFO - Train Loss: 0.0177, Train F1 (macro): 0.5805, Train F1 (micro): 0.9963, Train Hamming Loss: 0.0037, Train mAP: 0.1851
2024-09-05 15:21:42,568 - INFO - Val Loss: 0.0188, Val F1 (macro): 0.5663, Val F1 (micro): 0.9963, Val Hamming Loss: 0.0037, Val mAP: 0.1548
Training: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1022/1022 [07:41<00:00, 2.21it/s, loss=0.0215]
Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:20<00:00, 7.15it/s]
[...]
2024-09-05 16:27:17,502 - INFO - Epoch 19
2024-09-05 16:27:17,502 - INFO - Train Loss: 0.0116, Train F1 (macro): 0.6765, Train F1 (micro): 0.9968, Train Hamming Loss: 0.0032, Train mAP: 0.4616
2024-09-05 16:27:17,502 - INFO - Val Loss: 0.0208, Val F1 (macro): 0.5698, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.1095
Training: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1022/1022 [07:41<00:00, 2.21it/s, loss=0.0116]
Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:20<00:00, 7.13it/s]
2024-09-05 16:35:29,332 - INFO - Epoch 20
2024-09-05 16:35:29,332 - INFO - Train Loss: 0.0103, Train F1 (macro): 0.7092, Train F1 (micro): 0.9970, Train Hamming Loss: 0.0030, Train mAP: 0.5503
2024-09-05 16:35:29,332 - INFO - Val Loss: 0.0212, Val F1 (macro): 0.5895, Val F1 (micro): 0.9960, Val Hamming Loss: 0.0040, Val mAP: 0.1201

So, um. Reading the mAP values, it starts at bad, increases to slightly less bad, then the overfitting kicks in and training mAP gets excellent and validation mAP drops to bad again. Not great.

Audio Tokens Part 2: The Architecture

Wed, 04 Sep 2024 11:11:12 -0700

Here’s how things are set up now, in preparation for the first real training test:

(I’ve skipped over the “is this thing basically working” phase here, since it wasn’t that exciting. Take it as given that each component seems to be working as expected.)

Training Set

I’m using the AudioSet “bal_train” set, since it’s only around 20,000 files. AudioSet files are 10-second clips from YouTube videos.

Validation Set

I’m using a very sophisticated technique in data_splitter.py to pull a validation set out of the bal_train set. If the associated YouTube ID for an audio clip starts with one of the characters [ABCDEF], it’s a validation example. Assuming YouTube IDs are random-ish, this isn’t totally terrible.

Audio Tokens Part 1: The Task

Tue, 03 Sep 2024 21:02:42 -0700

[7 Oct 2024 update: There’s a lot here in this worklog, dead ends and all. If you’re looking for the tl;dr/wrap-up, may I suggest the final (for now) post?]

[23 Sep 2024 update: This is a work log largely for my personal use: “Remember kids, the only difference between screwing around and science is writing it down”. There are mistakes and dead-ends and obvious oversights and serious facepalm moments. My plan is to leave in all the screwups, not try to retcon things later into a “I knew this from the beginning” sort of thing. As of today, things are still in progress, and there are ten of these entries. I expect there will be more.]