Audio Tokens Part 6: Slightly Less Basic

In our last episode, I had managed to get a dead-simple model to overfit on the sequences when cranking up the number of tokens in the vocabulary. This probably means one of two things (or probably something in between): The audio tokens have some useful information in them that can be generalized. The audio tokens have no useful information in them, and the overfitting is just because the model is able to memorize the embedding averages when there are many more embeddings involved. Let’s bet on the first one for now. Keep the spectrogram and token generation as-is, and try a slightly more complex model. Say hello to SimpleLSTMClassifier. ...

September 12, 2024 · Dan Avery

Audio Tokens Part 5: Back to Basics

OK, so things aren’t working as well as I’d hoped out of the box. Time to try a new box. Or a new metaphor. I’m going to try a super-simple network just to see if I can get something to fit. Enter SimpleTokenClassifier. Take the tokens, create embeddings, average pool them, shove them through a linear layer and see if anything useful comes out. With 50 tokens: ...

September 11, 2024 · Dan Avery

Audio Tokens Part 4: More Tokens!

Well, more types of tokens, anyway. Instead of a limited vocabulary of 50 tokens, let’s try something a little more interesting. Like 1000. [time passes] Ran it, here’s an excerpt: 2024-09-05 18:16:23,258 - INFO - Epoch 9 2024-09-05 18:16:23,258 - INFO - Train Loss: 0.0210, Train F1 (macro): 0.4990, Train F1 (micro): 0.9962, Train Hamming Loss: 0.0038, Train mAP: 0.0824 2024-09-05 18:16:23,258 - INFO - Val Loss: 0.0209, Val F1 (macro): 0.4991, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.0826 Training: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1022/1022 [07:40<00:00, 2.22it/s, loss=0.0169] Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:20<00:00, 7.22it/s] 2024-09-05 18:24:33,739 - INFO - Epoch 10 2024-09-05 18:24:33,739 - INFO - Train Loss: 0.0210, Train F1 (macro): 0.4990, Train F1 (micro): 0.9962, Train Hamming Loss: 0.0038, Train mAP: 0.0810 2024-09-05 18:24:33,739 - INFO - Val Loss: 0.0209, Val F1 (macro): 0.4991, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.0893 That’s…terrible. Let’s try bumping the learning rate from 5e-5 to 1e-3 just to show that we’re serious. ...

September 5, 2024 · Dan Avery

Audio Tokens Part 3: The First Run

Time to try the first run at training the model. Let’s see what happens! 2024-09-05 13:59:48,289 - INFO - Epoch 1 2024-09-05 13:59:48,289 - INFO - Train Loss: 0.0429, Train F1 (macro): 0.4998, Train F1 (micro): 0.9946, Train Hamming Loss: 0.0054, Train mAP: 0.0106 2024-09-05 13:59:48,289 - INFO - Val Loss: 0.0210, Val F1 (macro): 0.4991, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.1074 Training: 100%|████████████████████████████████████| 1022/1022 [07:40<00:00, 2.22it/s, loss=0.0196] Validating: 100%|█████████████████████████████████████████████████| 146/146 [00:20<00:00, 7.16it/s] 2024-09-05 14:07:58,594 - INFO - Epoch 2 2024-09-05 14:07:58,594 - INFO - Train Loss: 0.0208, Train F1 (macro): 0.5210, Train F1 (micro): 0.9962, Train Hamming Loss: 0.0038, Train mAP: 0.1135 2024-09-05 14:07:58,594 - INFO - Val Loss: 0.0205, Val F1 (macro): 0.5629, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.1144 Training: 100%|████████████████████████████████████| 1022/1022 [07:40<00:00, 2.22it/s, loss=0.0153] Validating: 100%|█████████████████████████████████████████████████| 146/146 [00:20<00:00, 7.17it/s] [...] 2024-09-05 15:21:42,568 - INFO - Epoch 11 2024-09-05 15:21:42,568 - INFO - Train Loss: 0.0177, Train F1 (macro): 0.5805, Train F1 (micro): 0.9963, Train Hamming Loss: 0.0037, Train mAP: 0.1851 2024-09-05 15:21:42,568 - INFO - Val Loss: 0.0188, Val F1 (macro): 0.5663, Val F1 (micro): 0.9963, Val Hamming Loss: 0.0037, Val mAP: 0.1548 Training: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1022/1022 [07:41<00:00, 2.21it/s, loss=0.0215] Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:20<00:00, 7.15it/s] [...] 2024-09-05 16:27:17,502 - INFO - Epoch 19 2024-09-05 16:27:17,502 - INFO - Train Loss: 0.0116, Train F1 (macro): 0.6765, Train F1 (micro): 0.9968, Train Hamming Loss: 0.0032, Train mAP: 0.4616 2024-09-05 16:27:17,502 - INFO - Val Loss: 0.0208, Val F1 (macro): 0.5698, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.1095 Training: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1022/1022 [07:41<00:00, 2.21it/s, loss=0.0116] Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:20<00:00, 7.13it/s] 2024-09-05 16:35:29,332 - INFO - Epoch 20 2024-09-05 16:35:29,332 - INFO - Train Loss: 0.0103, Train F1 (macro): 0.7092, Train F1 (micro): 0.9970, Train Hamming Loss: 0.0030, Train mAP: 0.5503 2024-09-05 16:35:29,332 - INFO - Val Loss: 0.0212, Val F1 (macro): 0.5895, Val F1 (micro): 0.9960, Val Hamming Loss: 0.0040, Val mAP: 0.1201 So, um. Reading the mAP values, it starts at bad, increases to slightly less bad, then the overfitting kicks in and training mAP gets excellent and validation mAP drops to bad again. Not great. ...

September 5, 2024 · Dan Avery

Audio Tokens Part 2: The Architecture

Here’s how things are set up now, in preparation for the first real training test: (I’ve skipped over the “is this thing basically working” phase here, since it wasn’t that exciting. Take it as given that each component seems to be working as expected.) Training Set I’m using the AudioSet “bal_train” set, since it’s only around 20,000 files. AudioSet files are 10-second clips from YouTube videos. Validation Set I’m using a very sophisticated technique in data_splitter.py to pull a validation set out of the bal_train set. If the associated YouTube ID for an audio clip starts with one of the characters [ABCDEF], it’s a validation example. Assuming YouTube IDs are random-ish, this isn’t totally terrible. ...

September 4, 2024 · Dan Avery

Audio Tokens Part 1: The Task

[7 Oct 2024 update: There’s a lot here in this worklog, dead ends and all. If you’re looking for the tl;dr/wrap-up, may I suggest the final (for now) post?] [23 Sep 2024 update: This is a work log largely for my personal use: “Remember kids, the only difference between screwing around and science is writing it down”. There are mistakes and dead-ends and obvious oversights and serious facepalm moments. My plan is to leave in all the screwups, not try to retcon things later into a “I knew this from the beginning” sort of thing. As of today, things are still in progress, and there are ten of these entries. I expect there will be more.] ...

September 3, 2024 · Dan Avery