In a previous post, I was checking to see if Perch 2.0 places similar humpback vocalizations near each other in its embedding space. “Bloop” sounds seemed to show up reliably in the 15 closest neighbors to a strong bloop, so it’s time to see if the embeddings perform similarly for another type of sound.

Here’s a fairly different sound. It has some tonal similarities to the bloop but the timbre is very different. It’s a sort of…tonal growl?

7532

It doesn’t seem to be a terribly common humpback vocalization, which means this could be a particularly good test of whether related vocalizations show up near each other in the embedding space. I’ll match the bloop test and check the closest fifteen:

7754(0.73)
7618(0.73)
7531(0.71)
7715(0.71)
8145(0.70)
7826(0.70)
7823(0.70)
7829(0.69)
7653(0.68)
8183(0.68)
7631(0.68)
7565(0.68)
7720(0.68)
7725(0.68)
8146(0.68)

It’s worth noting that only one of these (7531) is close in time to the original: it’s the previous five seconds, actually, which makes it more than likely it’s the same whale making the same sound. It’s also interesting that (and this is not obvious from the data shown here) all of these clips are in the >90% humpback-presence confidence bucket from the humpback detector model.

Overall, those are pretty similar. Some are tonally similar, some are more similar in timbre, and many of them are similar in both.


Let’s see how well the matches are for a less-definitive clip (at least according to the humpback detector model). Here’s a descending humpback vocalization that’s much weaker compared to the noise. The detector model put this clip in the 70-90% positive bucket:

7255

And the closest fifteen:

7012(0.88)
7153(0.87)
7125(0.87)
7346(0.87)
7269(0.87)
5665(0.86)
7172(0.86)
4479(0.86)
5846(0.86)
7351(0.86)
6769(0.86)
4241(0.86)
5743(0.86)
5744(0.86)
4452(0.86)

Now this is interesting.

The neighbors all sound like similar whale vocalizations, but the neighbors also all share the weak-signal/high-noise profile of the original. Not a clear-sounding vocalization among them. The embeddings seem to be capturing relationships between the entire scenes, not just the calls themselves.

Also, none of the fifteen neighbors is in the >90% detector bucket.1 Eight of them are in the 70-90% bucket (7012, 7153, 7125, 7346, 7269, 7172, 7351, 6769), four are in the 50-70% bucket (5665, 5846, 5743, 5744), and three of them are in the 30-50% bucket (4479, 4241, 4452). So while the humpback detector scored those last three clips below a 50% probability of containing a humpback vocalization, the Perch embeddings seem to reflect both the presence of a humpback vocalization and the specific type of vocalization.

It seems notable that the neighbors are all matches for the particular type of humpback vocalization and prominent background noise. No high-signal clips in the group.

So Perch may be grouping these clips by vocalization type, but also by overall sound profile.


  1. Again, that information isn’t present in the data shown here, so you’re not simply missing it. ↩︎