Audio-Tokens

Audio Tokens Part 18: The Wrap-Up

Updated 2025-04-22 with a brief intro for context. This project explored whether short-time audio features (STFT slices) could be clustered into symbolic “tokens” and modeled using sequence architectures like BERT. It didn’t work out the way I’d hoped, but I figured out a lot about where this kind of approach breaks down, and what might be worth trying next. (Also, I spent a few days chasing phantom performance gains thanks to a classic extend() vs append() bug.) ⸻ ...

Audio Tokens Part 17: All Sane, So Far

Here’s my checklist from the last post: Look at a few more generated spectrograms. Do they look sane? Continue. Do they look insane? Fix the spectrograms! They look sane. Moving on. Try the spectrograms with a standard vanilla CNN of the type that is known to work well on spectrograms. Do the results improve significantly? End this round of this project and move on–it doesn’t work as-is. Do the results not improve significantly? Keep going. They are roughly the same as all the other models. Peak val mAP I can get is 0.03-ish. Continuing. ...

Audio Tokens Part 16: Sanity Checks for Everyone!

I’ve been on a bit of a tear the past few days (right now at commit 3550615). I separated out the Dataset/DataLoader processing into its own classes,moved metrics calculation into its own class, did a bit of cleanup refactoring, and all of that so I could start just sending the raw STFT vectors into models as embeddings, and also add a new dirt-simple baseline model. All of this to try to figure out if the consistently terrible val mAP results that seem to happen on every variation of model and hyperparameters are just because this idea doesn’t work as-is, or if there might be another bug in the preprocessing pipeline mucking things up. ...

Audio Tokens Part 15: Onward or Not?

Carrying on from the last post, I tried several different combinations of STFT, LSTM, and clustering parameters. Without going into detail about all the things I tried, we can just skip to the results: ...

Audio Tokens Part 14: Back to Square 0.01

For reasons that may be obvious from the previous post, I needed to take a day off. Way too much time spent on a Python append/extend mixup for my taste. That bug has been in ModelTrainer from the initial commit. So all the metrics from the past few weeks were invalid. Now that that is fixed, the real, unadulterated results from the current setup are really, really, really not good. ...

Audio Tokens Part 13: The Big Bug

Batch clustering loops fixed, so large training sets are doable now. Let’s run a ~80,000 sample from the training sets, with ~8000 as validation: ...

Audio Tokens Part 12: A Minor Bug Squash

I decided that rather than find out why things were so much worse before that one commit, I’d look for data leakage in the newer, better-performing code. And I haven’t found any. The train/dev split is straighforward, using python indexing of a shuffled list of youtube ids. The centroids are computed only from training data. And the model training keeps the train and val data separate in two different Dataset classes. ...

Audio Tokens Part 11: The Hunt

Today I’m hunting down which commit moved my validation mAP from 0.15 to 0.41. One of two things happened here. A large bug was fixed, and performance was improved. A large bug was introduced, and the improved performance is a lie. Time to track that down, so I’m going back commit by commit to find where the bump happened. Apparently I did so many small housekeeping commits back there that I just missed the big bump in reported performance. Let’s go back and find it: ...

Audio Tokens Part 10: Ruling Out the Obvious, Leaving Only the Obvious

I spend most of yesterday scouring the code for train/dev data leakage, and not much luck. I also fixed up the 1D pre-processing convolution I thought I had got working before so it works now. But back to the weird thing about those high metrics. I also got most of the code ready for using more of the AudioSet training data, which mostly meant batching things instead of keeping the entire active training dataset in memory. ...

Audio Tokens Part 9: What the bug?

I going to crack open the AudioSet unbalanced_train set here. Only 20,000 training examples may not be enough. The unbalanced train set has around 2 MILLION examples. (The final eval set in AudioSet is only 20,000 items, which seems like not nearly enough, but then again I’m no fancy big-city Google researcher.) Which means I may need to stop keeping everything in memory during preprocessing, now doesn’t it? So I rewrote a bunch of stuff yesterday to allow for a larger training set, consolidate up the train/dev split code, and clean up the TokenizedSpecDataset class. ...