Audio-Tokens

Audio Tokens Part 8: Convoluted

[Update 19 Sep: Funny thing, this actually didn’t work out as planned. I forget to turn tokenization on for these examples. So the below change looks to be a random blip of some sort, not from adding convolution. Going to try to really add it later. Also, the comment below about running the same data over and over again accidentally was sort of true.] Before switching to major, focus-shifting architecture changes here I figured I’d try something simpler first. From the last update, I’m taking the first option: run a basic 1D convolution on the STFT time-slices to try to make them a bit more frequency-invariant. Eight kernels, kernels size 3, and just concatting them together to create the new input for the K-Means clustering. The previous input, recall, was just the entire STFT time-slice vector as-is. ...

Audio Tokens Part 7: Reconsidering

Given the current (lack of) performance of this model, I’m spending some time rethinking some of the basic ideas. For those of you following along at home, you may have been shouting at the screen since Part 1 trying to get me to see one large flaw in my tokenizing setup. Every STFT time slice is a set of particular frequencies at a moment in time. Which would probably work fine if the labels were about, for example, recognizing individual dogs. But our labels here are things like “dog barking” vs. “pneumatic drill” not “cute little Fluffy barking” vs “Butch barking right before he eats someone”. Some dogs have high frequency barks and some have low frequency barks, and STFT vectors from those two dog barks would be nowhere near each other in the vector space. ...

Audio Tokens Part 6: Slightly Less Basic

In our last episode, I had managed to get a dead-simple model to overfit on the sequences when cranking up the number of tokens in the vocabulary. This probably means one of two things (or probably something in between): The audio tokens have some useful information in them that can be generalized. The audio tokens have no useful information in them, and the overfitting is just because the model is able to memorize the embedding averages when there are many more embeddings involved. Let’s bet on the first one for now. Keep the spectrogram and token generation as-is, and try a slightly more complex model. Say hello to SimpleLSTMClassifier. ...

Audio Tokens Part 5: Back to Basics

OK, so things aren’t working as well as I’d hoped out of the box. Time to try a new box. Or a new metaphor. I’m going to try a super-simple network just to see if I can get something to fit. Enter SimpleTokenClassifier. Take the tokens, create embeddings, average pool them, shove them through a linear layer and see if anything useful comes out. With 50 tokens: ...

Audio Tokens Part 4: More Tokens!

Well, more types of tokens, anyway. Instead of a limited vocabulary of 50 tokens, let’s try something a little more interesting. Like 1000. [time passes] Ran it, here’s an excerpt: 2024-09-05 18:16:23,258 - INFO - Epoch 9 2024-09-05 18:16:23,258 - INFO - Train Loss: 0.0210, Train F1 (macro): 0.4990, Train F1 (micro): 0.9962, Train Hamming Loss: 0.0038, Train mAP: 0.0824 2024-09-05 18:16:23,258 - INFO - Val Loss: 0.0209, Val F1 (macro): 0.4991, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.0826 Training: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1022/1022 [07:40<00:00, 2.22it/s, loss=0.0169] Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:20<00:00, 7.22it/s] 2024-09-05 18:24:33,739 - INFO - Epoch 10 2024-09-05 18:24:33,739 - INFO - Train Loss: 0.0210, Train F1 (macro): 0.4990, Train F1 (micro): 0.9962, Train Hamming Loss: 0.0038, Train mAP: 0.0810 2024-09-05 18:24:33,739 - INFO - Val Loss: 0.0209, Val F1 (macro): 0.4991, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.0893 That’s…terrible. Let’s try bumping the learning rate from 5e-5 to 1e-3 just to show that we’re serious. ...

Audio Tokens Part 3: The First Run

Time to try the first run at training the model. Let’s see what happens! 2024-09-05 13:59:48,289 - INFO - Epoch 1 2024-09-05 13:59:48,289 - INFO - Train Loss: 0.0429, Train F1 (macro): 0.4998, Train F1 (micro): 0.9946, Train Hamming Loss: 0.0054, Train mAP: 0.0106 2024-09-05 13:59:48,289 - INFO - Val Loss: 0.0210, Val F1 (macro): 0.4991, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.1074 Training: 100%|████████████████████████████████████| 1022/1022 [07:40<00:00, 2.22it/s, loss=0.0196] Validating: 100%|█████████████████████████████████████████████████| 146/146 [00:20<00:00, 7.16it/s] 2024-09-05 14:07:58,594 - INFO - Epoch 2 2024-09-05 14:07:58,594 - INFO - Train Loss: 0.0208, Train F1 (macro): 0.5210, Train F1 (micro): 0.9962, Train Hamming Loss: 0.0038, Train mAP: 0.1135 2024-09-05 14:07:58,594 - INFO - Val Loss: 0.0205, Val F1 (macro): 0.5629, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.1144 Training: 100%|████████████████████████████████████| 1022/1022 [07:40<00:00, 2.22it/s, loss=0.0153] Validating: 100%|█████████████████████████████████████████████████| 146/146 [00:20<00:00, 7.17it/s] [...] 2024-09-05 15:21:42,568 - INFO - Epoch 11 2024-09-05 15:21:42,568 - INFO - Train Loss: 0.0177, Train F1 (macro): 0.5805, Train F1 (micro): 0.9963, Train Hamming Loss: 0.0037, Train mAP: 0.1851 2024-09-05 15:21:42,568 - INFO - Val Loss: 0.0188, Val F1 (macro): 0.5663, Val F1 (micro): 0.9963, Val Hamming Loss: 0.0037, Val mAP: 0.1548 Training: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1022/1022 [07:41<00:00, 2.21it/s, loss=0.0215] Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:20<00:00, 7.15it/s] [...] 2024-09-05 16:27:17,502 - INFO - Epoch 19 2024-09-05 16:27:17,502 - INFO - Train Loss: 0.0116, Train F1 (macro): 0.6765, Train F1 (micro): 0.9968, Train Hamming Loss: 0.0032, Train mAP: 0.4616 2024-09-05 16:27:17,502 - INFO - Val Loss: 0.0208, Val F1 (macro): 0.5698, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.1095 Training: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1022/1022 [07:41<00:00, 2.21it/s, loss=0.0116] Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:20<00:00, 7.13it/s] 2024-09-05 16:35:29,332 - INFO - Epoch 20 2024-09-05 16:35:29,332 - INFO - Train Loss: 0.0103, Train F1 (macro): 0.7092, Train F1 (micro): 0.9970, Train Hamming Loss: 0.0030, Train mAP: 0.5503 2024-09-05 16:35:29,332 - INFO - Val Loss: 0.0212, Val F1 (macro): 0.5895, Val F1 (micro): 0.9960, Val Hamming Loss: 0.0040, Val mAP: 0.1201 So, um. Reading the mAP values, it starts at bad, increases to slightly less bad, then the overfitting kicks in and training mAP gets excellent and validation mAP drops to bad again. Not great. ...

Audio Tokens Part 2: The Architecture

Here’s how things are set up now, in preparation for the first real training test: (I’ve skipped over the “is this thing basically working” phase here, since it wasn’t that exciting. Take it as given that each component seems to be working as expected.) Training Set I’m using the AudioSet “bal_train” set, since it’s only around 20,000 files. AudioSet files are 10-second clips from YouTube videos. Validation Set I’m using a very sophisticated technique in data_splitter.py to pull a validation set out of the bal_train set. If the associated YouTube ID for an audio clip starts with one of the characters [ABCDEF], it’s a validation example. Assuming YouTube IDs are random-ish, this isn’t totally terrible. ...

Audio Tokens Part 1: The Task

[7 Oct 2024 update: There’s a lot here in this worklog, dead ends and all. If you’re looking for the tl;dr/wrap-up, may I suggest the final (for now) post?] [23 Sep 2024 update: This is a work log largely for my personal use: “Remember kids, the only difference between screwing around and science is writing it down”. There are mistakes and dead-ends and obvious oversights and serious facepalm moments. My plan is to leave in all the screwups, not try to retcon things later into a “I knew this from the beginning” sort of thing. As of today, things are still in progress, and there are ten of these entries. I expect there will be more.] ...