For my next experiment testing Perch 2.0’s effectiveness in identifying humpback vocalizations in raw MBARI hydrophone clips, it’s time to take a step back.
In the first three posts (one, two, three), I was looking at whether Perch embeddings could be used to determine the presence or absence of a humpback vocalization in a particular clip. In one test, I checked high-confidence-positive(>=90% score) vs. high-confidence-negative(<=5%) clips as scored by the older humpback detector model. In a later test, I used lower-confidence clips (10-30% and 70-90% scores).
The results were mixed: using PCA and UMAP dimensionality reduction plots, the high-confidence clips seemed to be easily visually separable into clusters in the 2D embedding projections (with a bonus third cluster of dolphin vocalizations). For the lower-confidence clips, the clusters were much harder to differentiate. I then checked the lower-confidence dataset result with manual nearest-neighbor listening, which also didn’t show clear differentiation between whale and non-whale clips among the neighbors.
But what I didn’t do is test the higher-confidence clips with that same method: manual listening to sample clips’ nearest embedding neighbors. That might be a useful control, but it’s probably not the best next test. Given the obvious PCA/UMAP separation, I’d expect the high-confidence nearest-neighbor results to be pretty unsurprising.
However, it’s still worth performing the same nearest-neighbor check over a stratified sample covering all buckets of detector scores. The point of this testing is to see if Perch embeddings can be used out-of-the-box to determine the presence or absence of a humpback vocalization in a clip; a more varied test set should answer that question better than the previous ones. So I’m going to check that out now:
First clip: a high-confidence positive clip (peak detector score: 0.996).
Now the fifteen nearest neighbors, again by cosine similarity, again pulled from the large sample of ~9000 clips stratified across all score buckets from December 21, 2016:
Not a huge surprise here. They’re all positive clips.1 It’s slightly more interesting to see that all of those clips are actually in the >=90% score bucket, so the embeddings here line up with the humpback detector for these easy examples.
But it would probably be more useful to pull a deeper group of neighbors and find some clips that don’t fall in that >=90% bucket–clips where the embedding neighborhood doesn’t match up quite so nicely with the detector model.
I pulled the nearest 150 embeddings. Only 13 of them are outside the >=90% bucket: 11 in the 70-90% bucket, one in the 50-70% bucket, and one in the 10-30%(!) bucket.
Here are two 70-90% examples, followed by the 50-70% clip, then the 10-30% clip.
The last one is faint, but they’re all humpback vocalizations. So for this high-confidence positive query, the Perch neighborhood stays very humpback-heavy even when looking past the top fifteen neighbors. The detector scores vary a bit in the deeper neighbors, but the manually audible content still seems consistently humpback, even for the few clips that aren’t in the detector’s highest-confidence bucket.
Second clip: a high-confidence negative clip (peak detector score: 0.013)
Now the fifteen nearest neighbors:
(If you listened to all of those, my apologies. You just listened to 75 seconds of noise.)
Two interesting things here:
I’ve added the detector-assigned buckets by hand to those clip names. There’s a lot more variety in what the detector assigned to these neighbors than there was with the positive example’s neighbors. It would appear that the detector is assigning more humpback probability to these seemingly all-noise clips than their Perch embedding neighborhood might suggest. Perhaps Perch is grouping these clips by their overall noise profile similarities, while the detector is assigning a wider variety of buckets when it doesn’t have much signal to work with.
On close listening, two of the clips (2911 and 1823) are actually very faint positives. In Perch’s case, the embeddings are quite near our negative sample clip, and in the detector’s case, they got scores in the 10-30% and 5-10% buckets. So it seems as if neither approach is reliably separating those faint positives from the noise-only clips.
That makes the presence/absence question look harder than the sound-similarity question: Perch can group similar acoustic scenes, but faint vocalizations may not be enough to move a clip out of a noisy neighborhood.
For the next post, I’ll try these with lower-confidence positive and negative clips.
And very similar. In listening to these, I occasionally had to check filenames to make sure I wasn’t actually listening to the same clip over and over. ↩︎