perch-mbari on This Might Be Something!

Perch and MBARI Clips: Wrap-Up

Wed, 27 May 2026 10:26:34 -0700

(Previously: 1, 2, 3, 4, 5, 6, 7, 8)

It’s time to wrap up this investigation for now. I’ve been looking into how Google’s Perch 2.0 model handles raw clips from the MBARI Monterey Bay hydrophone. I’ve been checking how Perch’s generated embeddings might be used to distinguish between: 1) the presence or absence of a humpback vocalization; and 2) different types of humpback vocalizations. I’ve been using the earlier Google humpback detector model for direct comparison in (1) and indirect comparison in (2).

These experiments had three primary axes:

the patterns under consideration: presence/absence or type of vocalization
the type of clips used: low-score humpback detector clips, high-score humpback detector clips, or a stratified sample of scores
the method for examining the clips: dimensionality-reduction visualization, or manual listening

Back to the matrix in part 5, with later results added:

Present or Not

	PCA/UMAP	Nearest-neighbor listening
High-confidence clips	Visually yes	mostly yes
Lower-confidence clips	Not clearly	Not really
Stratified sample	–	mixed yes

Types of Vocalizations¹

Question	Method	Result
Do “bloop” clips retrieve other bloop-like clips?	Nearest-neighbor listening	apparently yes
Do lower-confidence / noisier vocalizations retrieve similar sounds?	Nearest-neighbor listening	sort of, but scene/noise similarity matters
Do vocalization types form visible PCA/UMAP clusters?	PCA/UMAP	not tested

The most interesting observations:

The Perch embeddings seemed to work on the entire scene, not just the vocalizations. They suggest significantly closer relationships between lower-signal clips and other lower-signal clips than between lower-signal and higher-signal clips, even when the same apparent vocalization is present.
Detector scores varied much more in negative/noisy neighborhoods than in obviously positive neighborhoods. That suggests the detector and Perch are responding to different things: the detector outputs a humpback-presence score possibly regardless of the type of vocalization, while Perch groups clips by overall acoustic similarity.
For a manually classified negative/high-confidence detector-negative query clip, two of the nearest neighbors turned out to contain faint humpback vocalizations. The detector model scored those examples too low, and the Perch embeddings still placed them very close to a seemingly negative query clip. So relying on either method to correctly classify those clips as positive could be problematic.

Next Steps

The ulterior motive of all this testing was actually to see if Perch 2.0 would be a good model to add as a built-in pretrained model for AudioLoop, my in-development tool for creating labeled audio datasets from large unlabeled ones. The current starting models are intentionally quite simple, but it’s time to add a pretrained model to the mix. Given these Perch embedding results on untouched MBARI hydrophone data, and especially given that Perch wasn’t even trained with marine data, it seems like an obvious addition to AudioLoop’s general-audio labeling and dataset-exploration toolbelt.

The full set of test combinations didn’t seem as useful here given the results of the stratified-sampling experiments, so I’ve collapsed this table accordingly. ↩︎

Perch and MBARI Clips: Low Confidence Neighbors

Fri, 22 May 2026 14:22:39 -0700

Time to continue looking into whether Perch 2.0’s embeddings make it easy to distinguish between humpback-present and humpback-absent clips in raw MBARI hydrophone recordings.

Last time, I looked at two clips with high-confidence scores from Google’s humpback detector model. Then I compared their Perch-computed embeddings to their nearest neighbors using a ~9000-clip sample stratified across detector-score buckets.

For the high-confidence humpback-positive clip, the Perch embedding’s 150 nearest neighbors and the detector scores seemed to be in line: the neighbors were almost all in the detector highest-confidence bucket.

For the high-confidence humpback-negative clip, the 15 nearest neighbors had detector scores that varied much more: between 0% and 50%. In addition, on manual listening it was noteworthy that two of those clips were actually faint positives that had low detector-model scores while their embeddings were still in the very close neighborhood of the known negative test clip. This suggested that classifying faint positives using either the detector model or Perch embeddings could be difficult.

It’s time to try the same experiment, but with lower-confidence positive and negative clips (70-90% and 10-30% detector model scores) and see if their embedding neighborhoods have the same characteristics.

First up, a lower-confidence positive clip (peak detector score: 0.802)

7273

Sounds like a humpback, but definitely not as high-signal as the high-confidence clips, as expected.

Now the 15 nearest neighbors, with cosine similarity and detector-score bucket:

6819(0.93)(70-90%)

7094(0.93)(70-90%)

7297(0.92)(70-90%)

3163(0.92)(10-30%)

8805(0.92)(>=90%)

6964(0.92)(70-90%)

8895(0.92)(>=90%)

7283(0.92)(70-90%)

8920(0.92)(>=90%)

6910(0.92)(70-90%)

6679(0.92)(70-90%)

5543(0.92)(50-70%)

6721(0.92)(70-90%)

7127(0.92)(70-90%)

7110(0.91)(70-90%)

There’s more variety in the detector-score buckets than we had with the high-confidence clip. Ten of the fifteen neighbors are also in the 70–90% bucket, three are in the >=90% bucket, one is in 50–70%, and one is in 10–30%. Since those clips also sound humpback-positive, Perch is still retrieving humpback-positive neighbors for a humpback-positive query clip. The tougher question is whether Perch is grouping these embeddings by humpback presence alone, or by the larger acoustic scene.

Also of note is that the 10-30% score clip seems pretty clearly to be a positive. That looks like a miss by the humpback detector. But it shows up in the Perch neighborhood with all the other positive clips.

Moving to a lower-confidence negative clip (peak detector score: 0.200)

3376

This is an interesting clip because it has some sort of faint unidentifiable sound that might be a humpback.

And the 15 nearest neighbors:

1892(0.96)(5-10%)

4111(0.96)(30-50%)

2860(0.96)(10-30%)

896(0.96)(<=5%)

3524(0.96)(10-30%)

963(0.96)(<=5%)

910(0.96)(<=5%)

2080(0.96)(5-10%)

974(0.96)(<=5%)

908(0.96)(<=5%)

982(0.96)(<=5%)

933(0.96)(<=5%)

3222(0.96)(10-30%)

956(0.96)(<=5%)

3535(0.96)(10-30%)

Note that all of the clips are very close by in the embedding space. This is a dense neighborhood. They appear to my ears to be mostly noise, with the occasional faint, evocative, yet unidentifiable sound thrown in. And the humpback detector is pretty confident about many of them. It’s more confident about the lack of a humpback in those clips than it is about the original clip.

For lower-confidence negatives, Perch seems to be putting this clip in a dense neighborhood of similar noise-like clips. But all of those clips seem fair to put into the negative category–the neighborhood placement generally agrees with the low scores of the detector.

It’s about time to wrap up this set of experiments. In the next post I’ll do just that.

Perch and MBARI Clips: Noisy Neighbors

Thu, 21 May 2026 12:19:14 -0700

For my next experiment testing Perch 2.0’s effectiveness in identifying humpback vocalizations in raw MBARI hydrophone clips, it’s time to take a step back.

In the first three posts (one, two, three), I was looking at whether Perch embeddings could be used to determine the presence or absence of a humpback vocalization in a particular clip. In one test, I checked high-confidence-positive(>=90% score) vs. high-confidence-negative(<=5%) clips as scored by the older humpback detector model. In a later test, I used lower-confidence clips (10-30% and 70-90% scores).

The results were mixed: using PCA and UMAP dimensionality reduction plots, the high-confidence clips seemed to be easily visually separable into clusters in the 2D embedding projections (with a bonus third cluster of dolphin vocalizations). For the lower-confidence clips, the clusters were much harder to differentiate. I then checked the lower-confidence dataset result with manual nearest-neighbor listening, which also didn’t show clear differentiation between whale and non-whale clips among the neighbors.

But what I didn’t do is test the higher-confidence clips with that same method: manual listening to sample clips’ nearest embedding neighbors. That might be a useful control, but it’s probably not the best next test. Given the obvious PCA/UMAP separation, I’d expect the high-confidence nearest-neighbor results to be pretty unsurprising.

However, it’s still worth performing the same nearest-neighbor check over a stratified sample covering all buckets of detector scores. The point of this testing is to see if Perch embeddings can be used out-of-the-box to determine the presence or absence of a humpback vocalization in a clip; a more varied test set should answer that question better than the previous ones. So I’m going to check that out now:

First clip: a high-confidence positive clip (peak detector score: 0.996).

7925

Now the fifteen nearest neighbors, again by cosine similarity, again pulled from the large sample of ~9000 clips stratified across all score buckets from December 21, 2016:

8206(0.87)

8069(0.86)

8041(0.85)

8305(0.85)

7907(0.85)

7738(0.84)

8142(0.84)

8357(0.84)

7866(0.82)

8016(0.82)

8113(0.82)

7984(0.81)

7940(0.81)

7868(0.81)

8098(0.81)

Not a huge surprise here. They’re all positive clips.¹ It’s slightly more interesting to see that all of those clips are actually in the >=90% score bucket, so the embeddings here line up with the humpback detector for these easy examples.

But it would probably be more useful to pull a deeper group of neighbors and find some clips that don’t fall in that >=90% bucket–clips where the embedding neighborhood doesn’t match up quite so nicely with the detector model.

I pulled the nearest 150 embeddings. Only 13 of them are outside the >=90% bucket: 11 in the 70-90% bucket, one in the 50-70% bucket, and one in the 10-30%(!) bucket.

Here are two 70-90% examples, followed by the 50-70% clip, then the 10-30% clip.

6388(0.72)

6774(0.69)

4964(0.65)

2631(0.66)

The last one is faint, but they’re all humpback vocalizations. So for this high-confidence positive query, the Perch neighborhood stays very humpback-heavy even when looking past the top fifteen neighbors. The detector scores vary a bit in the deeper neighbors, but the manually audible content still seems consistently humpback, even for the few clips that aren’t in the detector’s highest-confidence bucket.

Second clip: a high-confidence negative clip (peak detector score: 0.013)

203

Now the fifteen nearest neighbors:

4408(0.98)(30-50%)

233(0.98)(<=5%)

518(0.98)(<=5%)

500(0.97)(<=5%)

2911(0.97)(10-30%)

3007(0.97)(10-30%)

1823(0.97)(5-10%)

255(0.97)(<=5%)

1861(0.97)(5-10%)

838(0.97)(<=5%)

240(0.97)(<=5%)

5555(0.97)(50-70%)

833(0.97)(<=5%)

316(0.97)(<=5%)

4639(0.97)(30-50%)

(If you listened to all of those, my apologies. You just listened to 75 seconds of noise.)

Two interesting things here:

I’ve added the detector-assigned buckets by hand to those clip names. There’s a lot more variety in what the detector assigned to these neighbors than there was with the positive example’s neighbors. It would appear that the detector is assigning more humpback probability to these seemingly all-noise clips than their Perch embedding neighborhood might suggest. Perhaps Perch is grouping these clips by their overall noise profile similarities, while the detector is assigning a wider variety of buckets when it doesn’t have much signal to work with.
On close listening, two of the clips (2911 and 1823) are actually very faint positives. In Perch’s case, the embeddings are quite near our negative sample clip, and in the detector’s case, they got scores in the 10-30% and 5-10% buckets. So it seems as if neither approach is reliably separating those faint positives from the noise-only clips.

That makes the presence/absence question look harder than the sound-similarity question: Perch can group similar acoustic scenes, but faint vocalizations may not be enough to move a clip out of a noisy neighborhood.

For the next post, I’ll try these with lower-confidence positive and negative clips.

And very similar. In listening to these, I occasionally had to check filenames to make sure I wasn’t actually listening to the same clip over and over. ↩︎

Perch and MBARI Clips: Signal and Noise

Wed, 20 May 2026 14:22:29 -0700

In a previous post, I was checking to see if Perch 2.0 places similar humpback vocalizations near each other in its embedding space. “Bloop” sounds seemed to show up reliably in the 15 closest neighbors to a strong bloop, so it’s time to see if the embeddings perform similarly for another type of sound.

Here’s a fairly different sound. It has some tonal similarities to the bloop but the timbre is very different. It’s a sort of…tonal growl?

7532

It doesn’t seem to be a terribly common humpback vocalization, which means this could be a particularly good test of whether related vocalizations show up near each other in the embedding space. I’ll match the bloop test and check the closest fifteen:

7754(0.73)

7618(0.73)

7531(0.71)

7715(0.71)

8145(0.70)

7826(0.70)

7823(0.70)

7829(0.69)

7653(0.68)

8183(0.68)

7631(0.68)

7565(0.68)

7720(0.68)

7725(0.68)

8146(0.68)

It’s worth noting that only one of these (7531) is close in time to the original: it’s the previous five seconds, actually, which makes it more than likely it’s the same whale making the same sound. It’s also interesting that (and this is not obvious from the data shown here) all of these clips are in the >90% humpback-presence confidence bucket from the humpback detector model.

Overall, those are pretty similar. Some are tonally similar, some are more similar in timbre, and many of them are similar in both.

Let’s see how well the matches are for a less-definitive clip (at least according to the humpback detector model). Here’s a descending humpback vocalization that’s much weaker compared to the noise. The detector model put this clip in the 70-90% positive bucket:

7255

And the closest fifteen:

7012(0.88)

7153(0.87)

7125(0.87)

7346(0.87)

7269(0.87)

5665(0.86)

7172(0.86)

4479(0.86)

5846(0.86)

7351(0.86)

6769(0.86)

4241(0.86)

5743(0.86)

5744(0.86)

4452(0.86)

Now this is interesting.

The neighbors all sound like similar whale vocalizations, but the neighbors also all share the weak-signal/high-noise profile of the original. Not a clear-sounding vocalization among them. The embeddings seem to be capturing relationships between the entire scenes, not just the calls themselves.

Also, none of the fifteen neighbors is in the >90% detector bucket.¹ Eight of them are in the 70-90% bucket (7012, 7153, 7125, 7346, 7269, 7172, 7351, 6769), four are in the 50-70% bucket (5665, 5846, 5743, 5744), and three of them are in the 30-50% bucket (4479, 4241, 4452). So while the humpback detector scored those last three clips below a 50% probability of containing a humpback vocalization, the Perch embeddings seem to reflect both the presence of a humpback vocalization and the specific type of vocalization.

It seems notable that the neighbors are all matches for the particular type of humpback vocalization and prominent background noise. No high-signal clips in the group.

So Perch may be grouping these clips by vocalization type, but also by overall sound profile.

Again, that information isn’t present in the data shown here, so you’re not simply missing it. ↩︎

Perch and MBARI clips: A Quick Reset

Tue, 19 May 2026 10:20:59 -0700

Four posts into this (one, two, three, four), it’s probably time to back up for a moment and recap.

Questions

I’ve been looking at two different questions:

Given raw MBARI hydrophone clips, can Perch 2.0:

Distinguish between clips where a humpback vocalization is present vs not?
Distinguish between different types of humpback vocalizations?

Data

Using three different groups of clips:¹

High-confidence detector-score clips (as estimated by the Google humpback detector model with scores of >=90% and <=5%)
Lower-confidence detector-score clips (scores of 70–90% and 10–30%)
A large (~9000) stratified sample across detector-score buckets

Assessment

And checking the results in two ways:

2D dimensionality reduction plots
Manual listening to embedding nearest neighbors

Which ends up being twelve different potential experiments:

Present or Not

	PCA/UMAP	Nearest-neighbor listening
High-confidence clips	Visually yes	later?
Lower-confidence clips	Not clearly	Not really
Stratified sample	later?	next?

Types of vocalizations

	PCA/UMAP	Nearest-neighbor listening
High-confidence clips	–	–
Lower-confidence clips	–	–
Stratified sample	later?	One tentative yes (bloops only so far)

This is all still quite exploratory, but the matrix will keep me from treating different tests as if they answered the same question.

I don’t think I need to fill in all twelve boxes just for completeness, but the “next?” box looks like the best next step and I’ll keep the boxes marked “later?” in mind.

All clips are from the MBARI December 21, 2016 16 kHz full-day audio file. ↩︎

Perch and MBARI Clips: Bloops

Mon, 18 May 2026 16:40:30 -0700

In the last post, it started to look like Perch 2.0, at least on less-definitely humpback MBARI hydrophone clips, doesn’t reveal much whale/non-whale structure in its embeddings.

But a quick listen to some of the positive clips raises an important point: humpbacks make a lot of different types of sounds.¹ It’s possible that the embeddings are tied more to the qualities of the individual calls than to the presence or absence of a call at all.

So it would make sense to do some more by-ear listening to some nearest-neighbor groups and see if they share any common audible characteristics.

I’m still using the older humpback-specific detector model to classify clips into various confidence-level buckets (0-10% for strongly no-whale, 90-100% for strongly whale, and so on).

For this listening I’ll take some obviously (both to the detector and me) humpback-containing audio clips with different kinds of humpback sounds, and compare them to their nearest neighbors in the Perch embeddings.

For the last tests, I used 200 clips as the mini-corpus for producing similarity scores, split evenly between likely-positive and likely-negative. For this one, I’m going to use significantly more, since I’m looking for finer-grained distinctions. I’ve pulled out ~9000 clips from 12/21/16, spread fairly evenly across buckets.

A quick digression: Let’s look at the actual clip distribution across buckets.

That’s pretty U-shaped. The detector for this day is often quite confident of the presence or absence of a whale sound.

Here’s a pretty distinct clip to start with. It’s a sort of bloop:²

7695

And the five closest clip embeddings by cosine similarity:

7697(0.95)

7693(0.93)

7696(0.88)

7759(0.88)

6107(0.86)

Critically here, the first three of the five are within 60 seconds of the original clip,³ which doesn’t make this a terribly useful comparison as-is. Those clips could very well be the same whale making the same sounds over that particular minute.

So let’s get the next ten closest embeddings:

7939(0.84)

6108(0.84)

8271(0.83)

7694(0.83)

8200(0.82)

7803(0.82)

6239(0.80)

7695(0.79)

7881(0.79)

8091(0.79)

That’s a more varied set of clips, from different times of the day.

And every single one of them has a bloop or two.

Perch embeddings seem to be mapping distinct bloop clips close by in the embedding space.

It doesn’t appear that the field has really settled on a taxonomy for whale vocalizations yet. I’ll just try to describe them as I go. ↩︎
It’s still notable that in many years of 24/7 MBARI hydrophone recordings, very few clips have signals as clear as those from December 21, 2016. It’s quite the day: a bit of an open-hydrophone night at the cafe tonight. ↩︎
The first clue is the clip numbers are almost in sequence. The second, more definitive clue is that the filenames have the times in them, although the current audio embedding mechanism I’m using here doesn’t display them, so you’d have to look at the HTML source to know that. ↩︎

Perch and MBARI Clips: Listening to the Neighbors

Fri, 15 May 2026 14:40:42 -0700

In the last post, I was seeing if Perch embeddings produced from less-confident predictions of the humpback detector model produced easy-to-differentiate PCA clusters. While there was some obvious grouping, the embeddings didn’t appear to provide clear visual clusters of whale and non-whale clips.

But maybe we just needed more data. That last plot just used 20 clip embeddings. So let’s plot 100 likely-positive and 100 likely-negative one instead.

Here’s the PCA plot:

That’s showing much less obvious separation than I expected. Before trying some manual nearest-neighbor testing, I’ll drop the embeddings into a UMAP plot (cosine similarity, n_neighbors=15) to see if it can find any obvious structure:

Well, yes. And no. It’s definitely showing a bit more organization in the embedding space, but there are some interesting overlaps.

Simple low-dimensional projections may not be the best tool for assessing the ability of Perch embeddings to distinguish between whale and non-whale. Let’s listen to examples from a few apparent groupings and their nearest neighbors in embedding space using cosine similarity.

We’ll start with something that looks to be in a fairly well-defined positive group, in the lower center of that UMAP plot. Here’s clip 130, which the humpback detector’s model put in the higher-confidence bucket:

130

That’s a positive, if a bit noisy. Now let’s compute the nearest neighbors of clip 130, which turn out to be 123, 110, 12, 6, and 5. It’s notable that those last three are nowhere near 130 in the UMAP plot, and only reasonably close by in the PCA plot. The implication here is that we’re losing a significant amount of information in the 2D projections.

It’s also worth noting that the first two closest neighbors are in the humpback detector’s higher-confidence bucket, but the last three are in the lower-confidence one.

Here are the clips:

123(.74)

Easy positive.

110(.69)

Fairly easy positive.

12(.66)

There might be something in the last half-second of that clip, but I’m inclined to lean negative.

6(.65)

Definitely something in the last half-second of that one. Leaning positive.

15(.64)

This one sounds to me like a negative.

In this case, the closest neighbors to a noisy positive clip seem fairly evenly split between positives and negatives, without any immediately obvious whale-specific features.

Moving on to one in an apparent strong non-whale cluster in the UMAP plot: clip 73, at the top of the plot.

A definite negative, with a continuous, low, pulsing sound (a ship engine?)

Moving on to its nearest neighbors–these are all low-confidence clips from the detector: 72, 75, 69, 57, 30.

72(.96)

Very similar to clip 73.

75(.95)

Same as the last one.

69(.94)

Same overall sound profile as the last two, but there might be a positive in seconds 1 and 2.

57(.94)

If there’s something there, it’s not completely breaking through the noise.

30(.94)

Again, it’s not hard to hear something there, but overall I’d call this a negative.

A few interesting points here:

The cosine similarities of these embeddings are quite high.
The first three neighbor clips, along with the central clip, are all within the same 10-minute time period. Not unexpectedly, they share some obvious sound features.

At least for these raw, noisy, mid-confidence MBARI clips, in these tests, the Perch embeddings don’t seem to be organizing these clips into obvious whale/non-whale structures. They may instead be indexing on other meaningful acoustic structure.

Perch and MBARI Clips: Beyond the Easy Cases

Thu, 14 May 2026 13:25:42 -0700

Continuing on from the last post about Perch 2.0 embeddings from MBARI raw hydrophone data, it seems important to note that the PCA plot in that last only looked at the extremes–the clips where the humpback detector model expressed strong positive confidence (>90%) or extreme negative confidence (<5%).

So let’s look at what happens when we take slightly less obvious examples. Let’s compare 70-90% vs 10-30% confidence clips. We’ll use the same day, December 21, 2016, because it has a lot of high-confidence examples.¹

The clips are embedded below.

Here’s the PCA plot of 10 detector-tagged likely whale and 10 likely not-whale:

This is a bit more interesting. Either a) the humpback detector is getting some classifications wrong, b) Perch embeddings can’t easily distinguish these less obvious clips, or the c) the PCA axes of variance-maximization don’t correspond well to what we’re looking for here.

Let’s start with the first, and manually verify the humpback detector’s classification:

10-30% likely humpback (“correct” here means not-humpback):

correct. Clearly dolphin.
correct. More dolphin.
correct. Background noise
incorrect. Sounds like a humpback at 0:02
correct. Indistinguishable noise
correct. Again, indistinguishable, although I think I might hear a humpback off in the distance
tossup. The last second of this clip sounds humpback-like.
incorrect. There are a couple faint humpback bloops there
correct. Noise
correct. There’s a tone, but it doesn’t sound whale-like

70-90% likely humpback (“correct” here means humpback):

correct. That’s a whale or two
correct. Humpback
correct. This is a close one, but the last second qualifies it
correct. Humpback
correct. Humpback
correct. Humpback roar
correct. Faint, but there
correct. Fainter, but there
correct. Humpback
correct. Humpback call in the first half-second

Let’s call it 17.5/20. Not as perfect results as with the 10/90 clips in the last post. But if we correct the plot with the manual label corrections:

Based on this new plot, and with knowledge of the ground truth as determined by my listening², we can see a few different groups:

The dolphins (1,2)
The definitive whale clips (11,12,14,15,16,19)
The less-obvious whale clips (4,8,13,17,18,20)³
The not-whale clips (3,5,6,7,9,10)

But (3) and (4) really aren’t easily distinguishable in this representation of the Perch embeddings without ground-truth knowledge.

It seems that once we correct for the humpback detector model’s incorrect classification, for less-obvious positive and negative examples the clusters don’t separate in the embedding space in a clearly obvious way, but they certainly show tendencies.

At this point, I can either try a larger dataset and see if clearly-detectable clusters resolve, or a different visualization besides PCA.

If the embedding clusters aren’t apparently separable with any other methods, we might have to consider that the older, more specific humpback-detector model separates the positive and negative classes more cleanly than a simple Perch embedding visualization can.

More attempts to see what Perch is capable of in the next post.

Clips

The MBARI example humpback detector notebook uses this date, I would assume, because there are a lot of fantasically strong humpback calls that day. Most days that I’ve listened to don’t have anything near the strength of that day’s clips. Running the humpback detector model on June 21, 2016, for example doesn’t find any 90% confidence clips (and only a few >70% ones). December 21 is simply full of them. ↩︎
Which, to be fair, is less an assessment of ground truth and more an assessment of my particular whale-call sensing capabilities. If you disagree about some of my decisions, let me know! ↩︎
(Although I’d say that 16 and 19 are not actually that close-call.) ↩︎

Creating MBARI clip embeddings with Perch 2.0

Wed, 13 May 2026 14:05:44 -0700

I’ve been fascinated with the MBARI hydrophone recording archive for a while now. It’s from a hydrophone installation in Monterey Bay that’s been recording almost 24/7 since 2015. It’s partly what inspired me to make AudioLoop–all that data deserves to be made into some useful labeled datasets.

I’ve been curious to see how well Perch 2.0 does at extracting useful embeddings from these recordings. A Google DeepMind paper claims that “despite having almost no marine training data,” Perch 2.0 performs well at marine species classification.

I thought I’d try that with some raw¹ MBARI recordings. I used an older Google/NOAA humpback detector model and pulled out 5-second clips from a particular MBARI 24-hour audio file (December 21, 2016 in this case), then bucketed them into various humpback-detection confidence levels based on the model’s reported confidence level.

Then I fed some clips into Perch 2.0 to create embeddings. For the first round, I stuck to the >90% (likely humpback) and <5% (likely not-humpback) buckets as a first-pass sanity check. I then took ten clips from each of those buckets, sent them through Perch, and made a PCA plot of the Perch embeddings.

And we get this:

Which shows that, even given MBARI clips with a high noise floor, Perch seems to separate whale and not-whale pretty well². You can hear the clips below.³

What’s more interesting is that the non-humpback clips seem to be split into two clusters, and that they aren’t just PCA phantom clusters that fall apart on closer inspection. Clips 3-8 are only background noise (the ocean is noisy), but the slightly-separated group of clips (1, 2, 9, 10) all have dolphin vocalizations! Clips 1 and 2 have prominent vocalizations, 9 and 10 much quieter ones.

So just using raw, unpreprocessed, noisy clips from the MBARI hydrophone, Perch 2.0 appears to separate noise-only, humpback, and dolphin sounds into their own embedding clusters.

Clips

The only preprocessing applied was removing the DC offset in the originals. ↩︎
It’s important to note here that Perch is technically not verifying ground-truth whale call presence, but labels inferred from the results of the previous humpback detector model. In this very limited set of clips, you can verify their ground truth yourself by listening to them. But at the very least Perch’s embeddings are expressing some of the same inherent audio features as the humpback-specific model. ↩︎
These clips have been scaled up to a reasonable listening level. ↩︎

perch-mbari on This Might Be Something!

Perch and MBARI Clips: Wrap-Up

Present or Not

Types of Vocalizations1

Next Steps

Perch and MBARI Clips: Low Confidence Neighbors

Perch and MBARI Clips: Noisy Neighbors

Perch and MBARI Clips: Signal and Noise

Perch and MBARI clips: A Quick Reset

Questions

Data

Assessment

Present or Not

Types of vocalizations

Perch and MBARI Clips: Bloops

Perch and MBARI Clips: Listening to the Neighbors

Perch and MBARI Clips: Beyond the Easy Cases

10-30% likely humpback (“correct” here means not-humpback):

70-90% likely humpback (“correct” here means humpback):

Clips

Creating MBARI clip embeddings with Perch 2.0

Clips

Types of Vocalizations¹