(Previously: 1, 2, 3, 4, 5, 6, 7, 8)
It’s time to wrap up this investigation for now. I’ve been looking into how Google’s Perch 2.0 model handles raw clips from the MBARI Monterey Bay hydrophone. I’ve been checking how Perch’s generated embeddings might be used to distinguish between: 1) the presence or absence of a humpback vocalization; and 2) different types of humpback vocalizations. I’ve been using the earlier Google humpback detector model for direct comparison in (1) and indirect comparison in (2).
These experiments had three primary axes:
- the patterns under consideration: presence/absence or type of vocalization
- the type of clips used: low-score humpback detector clips, high-score humpback detector clips, or a stratified sample of scores
- the method for examining the clips: dimensionality-reduction visualization, or manual listening
Back to the matrix in part 5, with later results added:
Present or Not
| PCA/UMAP | Nearest-neighbor listening | |
|---|---|---|
| High-confidence clips | Visually yes | mostly yes |
| Lower-confidence clips | Not clearly | Not really |
| Stratified sample | – | mixed yes |
Types of Vocalizations1
| Question | Method | Result |
|---|---|---|
| Do “bloop” clips retrieve other bloop-like clips? | Nearest-neighbor listening | apparently yes |
| Do lower-confidence / noisier vocalizations retrieve similar sounds? | Nearest-neighbor listening | sort of, but scene/noise similarity matters |
| Do vocalization types form visible PCA/UMAP clusters? | PCA/UMAP | not tested |
The most interesting observations:
The Perch embeddings seemed to work on the entire scene, not just the vocalizations. They suggest significantly closer relationships between lower-signal clips and other lower-signal clips than between lower-signal and higher-signal clips, even when the same apparent vocalization is present.
Detector scores varied much more in negative/noisy neighborhoods than in obviously positive neighborhoods. That suggests the detector and Perch are responding to different things: the detector outputs a humpback-presence score possibly regardless of the type of vocalization, while Perch groups clips by overall acoustic similarity.
For a manually classified negative/high-confidence detector-negative query clip, two of the nearest neighbors turned out to contain faint humpback vocalizations. The detector model scored those examples too low, and the Perch embeddings still placed them very close to a seemingly negative query clip. So relying on either method to correctly classify those clips as positive could be problematic.
Next Steps
The ulterior motive of all this testing was actually to see if Perch 2.0 would be a good model to add as a built-in pretrained model for AudioLoop, my in-development tool for creating labeled audio datasets from large unlabeled ones. The current starting models are intentionally quite simple, but it’s time to add a pretrained model to the mix. Given these Perch embedding results on untouched MBARI hydrophone data, and especially given that Perch wasn’t even trained with marine data, it seems like an obvious addition to AudioLoop’s general-audio labeling and dataset-exploration toolbelt.
The full set of test combinations didn’t seem as useful here given the results of the stratified-sampling experiments, so I’ve collapsed this table accordingly. ↩︎