Perch and MBARI Clips: Wrap-Up

(Previously: 1, 2, 3, 4, 5, 6, 7, 8)

It’s time to wrap up this investigation for now. I’ve been looking into how Google’s Perch 2.0 model handles raw clips from the MBARI Monterey Bay hydrophone. I’ve been checking how Perch’s generated embeddings might be used to distinguish between: 1) the presence or absence of a humpback vocalization; and 2) different types of humpback vocalizations. I’ve been using the earlier Google humpback detector model for direct comparison in (1) and indirect comparison in (2).

These experiments had three primary axes:

the patterns under consideration: presence/absence or type of vocalization
the type of clips used: low-score humpback detector clips, high-score humpback detector clips, or a stratified sample of scores
the method for examining the clips: dimensionality-reduction visualization, or manual listening

Back to the matrix in part 5, with later results added:

Present or Not

	PCA/UMAP	Nearest-neighbor listening
High-confidence clips	Visually yes	mostly yes
Lower-confidence clips	Not clearly	Not really
Stratified sample	–	mixed yes

Types of Vocalizations¹

Question	Method	Result
Do “bloop” clips retrieve other bloop-like clips?	Nearest-neighbor listening	apparently yes
Do lower-confidence / noisier vocalizations retrieve similar sounds?	Nearest-neighbor listening	sort of, but scene/noise similarity matters
Do vocalization types form visible PCA/UMAP clusters?	PCA/UMAP	not tested

The most interesting observations:

The Perch embeddings seemed to work on the entire scene, not just the vocalizations. They suggest significantly closer relationships between lower-signal clips and other lower-signal clips than between lower-signal and higher-signal clips, even when the same apparent vocalization is present.
Detector scores varied much more in negative/noisy neighborhoods than in obviously positive neighborhoods. That suggests the detector and Perch are responding to different things: the detector outputs a humpback-presence score possibly regardless of the type of vocalization, while Perch groups clips by overall acoustic similarity.
For a manually classified negative/high-confidence detector-negative query clip, two of the nearest neighbors turned out to contain faint humpback vocalizations. The detector model scored those examples too low, and the Perch embeddings still placed them very close to a seemingly negative query clip. So relying on either method to correctly classify those clips as positive could be problematic.

Next Steps

The ulterior motive of all this testing was actually to see if Perch 2.0 would be a good model to add as a built-in pretrained model for AudioLoop, my in-development tool for creating labeled audio datasets from large unlabeled ones. The current starting models are intentionally quite simple, but it’s time to add a pretrained model to the mix. Given these Perch embedding results on untouched MBARI hydrophone data, and especially given that Perch wasn’t even trained with marine data, it seems like an obvious addition to AudioLoop’s general-audio labeling and dataset-exploration toolbelt.

The full set of test combinations didn’t seem as useful here given the results of the stratified-sampling experiments, so I’ve collapsed this table accordingly. ↩︎

Present or Not#

Types of Vocalizations1#

Next Steps#

Present or Not

Types of Vocalizations¹

Next Steps