Audio Tokens Part 15: Onward or Not?
Carrying on from the last post, I tried several different combinations of STFT, LSTM, and clustering parameters. Without going into detail about all the things I tried, we can just skip to the results: ...
Carrying on from the last post, I tried several different combinations of STFT, LSTM, and clustering parameters. Without going into detail about all the things I tried, we can just skip to the results: ...
For reasons that may be obvious from the previous post, I needed to take a day off. Way too much time spent on a Python append/extend mixup for my taste. That bug has been in ModelTrainer from the initial commit. So all the metrics from the past few weeks were invalid. Now that that is fixed, the real, unadulterated results from the current setup are really, really, really not good. ...
Batch clustering loops fixed, so large training sets are doable now. Let’s run a ~80,000 sample from the training sets, with ~8000 as validation: ...
I decided that rather than find out why things were so much worse before that one commit, I’d look for data leakage in the newer, better-performing code. And I haven’t found any. The train/dev split is straighforward, using python indexing of a shuffled list of youtube ids. The centroids are computed only from training data. And the model training keeps the train and val data separate in two different Dataset classes. ...
Today I’m hunting down which commit moved my validation mAP from 0.15 to 0.41. One of two things happened here. A large bug was fixed, and performance was improved. A large bug was introduced, and the improved performance is a lie. Time to track that down, so I’m going back commit by commit to find where the bump happened. Apparently I did so many small housekeeping commits back there that I just missed the big bump in reported performance. Let’s go back and find it: ...
I spend most of yesterday scouring the code for train/dev data leakage, and not much luck. I also fixed up the 1D pre-processing convolution I thought I had got working before so it works now. But back to the weird thing about those high metrics. I also got most of the code ready for using more of the AudioSet training data, which mostly meant batching things instead of keeping the entire active training dataset in memory. ...
I going to crack open the AudioSet unbalanced_train set here. Only 20,000 training examples may not be enough. The unbalanced train set has around 2 MILLION examples. (The final eval set in AudioSet is only 20,000 items, which seems like not nearly enough, but then again I’m no fancy big-city Google researcher.) Which means I may need to stop keeping everything in memory during preprocessing, now doesn’t it? So I rewrote a bunch of stuff yesterday to allow for a larger training set, consolidate up the train/dev split code, and clean up the TokenizedSpecDataset class. ...
[Update 19 Sep: Funny thing, this actually didn’t work out as planned. I forget to turn tokenization on for these examples. So the below change looks to be a random blip of some sort, not from adding convolution. Going to try to really add it later. Also, the comment below about running the same data over and over again accidentally was sort of true.] Before switching to major, focus-shifting architecture changes here I figured I’d try something simpler first. From the last update, I’m taking the first option: run a basic 1D convolution on the STFT time-slices to try to make them a bit more frequency-invariant. Eight kernels, kernels size 3, and just concatting them together to create the new input for the K-Means clustering. The previous input, recall, was just the entire STFT time-slice vector as-is. ...
Given the current (lack of) performance of this model, I’m spending some time rethinking some of the basic ideas. For those of you following along at home, you may have been shouting at the screen since Part 1 trying to get me to see one large flaw in my tokenizing setup. Every STFT time slice is a set of particular frequencies at a moment in time. Which would probably work fine if the labels were about, for example, recognizing individual dogs. But our labels here are things like “dog barking” vs. “pneumatic drill” not “cute little Fluffy barking” vs “Butch barking right before he eats someone”. Some dogs have high frequency barks and some have low frequency barks, and STFT vectors from those two dog barks would be nowhere near each other in the vector space. ...
I have about 3,000 web bookmarks I’ve collected over the years. The current collection started somewhere around 2010. I had them in Pinboard for a while, then moved over to Raindrop.io. I treat bookmarks they way most people do: Find an interesting web page somewhere. Say to myself “Wow, that looks neat! Maybe important! I need to read that!”. Bookmark it. Let the bookmark sit unused for several years. So maybe it’s time to go off-script. My hope is to every so often to pull up a randomly-fetched bookmark, finally read whatever it’s pointing to, and post it here with or without some trenchant commentary.