Finding Nemo....But with Deep Learning!

Om Kumar
Jan 28, 2024
7 min read

Updated: Nov 6, 2024

Day in day out, we hear some stats about the ever declining marine population, with headlines going...."The number of salmons in this area has dropped down to a few thousand..." you know, so on and so forth. But ever wonder how do we actually get these numbers? Or is it just some random figure that was thrown in out of pure speculation.... maybe we even have some of Nemo's friends working for us, and they keep a track of all the marine population! Who knows?

Image source: https://d23.com/

Well, being a Disney fan, I couldn't love that idea more, but that is NOT what happens. We don't want to be disturbing our aquatic mates right, by asking them to assemble and give a count at our beck and call? So we do something called Underwater Acoustic Surveys. Essentially, these surveys use sound or echoes of various frequencies which then bounce off of our aqua buddies and give a rough idea of their number, yeah you guessed it, it's closer to SONAR than it is to spying.

But hold on, the ocean is not just packed with these mighty or in most cases tiny creatures right? It has a plethora of other structures, other plants, underwater rocks, the waste that we dump into it every year, wreckage of various kinds and what not. So how do we come about distinguishing Nemo from Noise? Well first off, the distinction between the citizens of aquaworld and the structures there is pretty clear because both of them reflect sound in starkly different manner, what is tougher is telling one organism from other, or even one organism from the bubbles they produce while locomotion. For instance it's quite difficult to tell apart Jelly Fish from Juvenile Salmon on an echogram.

Now, traditionally, to identify these species on an echogram, we have been looking at the "geometry" of their aggregations, their position in the water column or the energy associated with the signal. And safe to say, they have given us decent results, but what these methods lack is the ability to distinguish bands of noise from the actual results. Also note here that they are identifying "aggregations" and not individual organisms, so they fail to give us acute information about them individually. So in this piece, we'll be examining a technique that was proposed in this paper as a way of analyzing echograms for this specific purpose.

Breaking down the Jargon for you...

If you accidently clicked on the link above, you would've been led to a page that says..."Detecting Underwater Discrete Scatterers in Echograms with Deep Learning-Based Semantic Segmentation".... which is quite a mouthful. Dear oh dear, the curse of academia! Anyways, bear with me and it'll be demystified in no time.

Alright so here Echograms are the "sketches" that you obtain when sound of various frequencies are scattered by the target (here marine organisms), now as examined above, traditional methods were only detecting the presence of aggregations like schools of fish or cloud like aggregations of Zooplanktons, but here the reseachers are aiming at "discrete scatterers" or individual organisms. And they are proposing to do this using a technique which is quite popular already, hell self-driving cars are using them, it's a technique to detect and identify all the entities that are present in an image, i.e. "Deep Learning Based Semantic Segmentation". The picture below gives a glimpse of what it does.

Image credits : Source

But keep in mind that what we receive are single frequency reflection waves from the target object, and to apply semantic segmentation, we need it to be in some visual form. So in the dataset that was used in the paper, the scattered signal strength ranged from -125 dB to 0dB.....

On a tangent: It was the first time that I'd seen negative intensity values for sound, so let me break that down. Since the intensity was measured in relative terms here, -125 dB essentially means that if the transmitted signal was of the intensity x dB, the reflected signal's intensity was x-125 dB.

.....hence these backscattering values are converted to RGB(Red, Green, Blue) integers using specific color maps.

Image credits: source Backscattering Visualized

Now the various labels that we see in the echograms are crucial, as they will be fed to our Deep Learning model. And bear in mind that these annotations were done either manually or with the help of commercial software(e.g. Echoview), while at the same time, taking the following biological cues into consideration:

The target organism can be small with relatively weak ability to reflect the signals
They can appear at any depth in the water column
The way they reflect the signal can be very similar to air bubbles, which can typically be seen forming cloud like structures at the surface or above the school.
Schools of Herring and Salmon present vertical elongated shapes and have relatively stronger scattering effects.
The residual echo from other scatterers can be seen as horizontal bands of noise.

Now let's dive a little deeper into this process of annotating our echograms here.

So the RGB images are converted into grayscale, and then in order to better segment the patterns, we "threshold" the image with a manually chosen value. To understand "thresholding" have a look at this image:

Image credits : source

So what has happened here is basically that we took a random pixel value, say 0.29 and then whatever pixels were below that value in the image were classified as black while the rest were classified as white. This is what thresholding means.

Okay now imagine that you were looking to label Jellyfish and you knew that they had a certain size range. So of all the segments that we obtained, we choose only those that fall in that size range and discard the rest.
The final labels are obtained by removing any noise that remains, and also by entirely cropping the top portion of the echogram to remove the water-air interface.

Image credits: source

Into the Semantics of it all

Alright so we have our data ready with us, now in the paper that this article is based on tried out several Semantic Segmentation models on the dataset, with some additional tricks. Based on the image that we looked at earlier about Semantic Segmentation, one can more or less get an idea of what these models would do. They make pixel level classification to create meaningful segments within an image, now these segments can represent anything, cars, trees, pedestrians or in our case, marine organisms.

Although diving too deep here would mean being crushed under the weight of the ocean of useless jargon, the whole process can be summed in three simple steps:

The image is taken, and series of "operations"(more like convolutions) are performed to reduce the spatial information while increasing the feature information i.e. the specifics about the shape and size of the target organisms.
Now handling too many features is a headache so we trim them down.
And then once the features and insights are "learned" the model tries to reconstruct the data by combining bits of spatial and feature information that were dropped earlier.

And viola! We have segmented version of the input image. Now there is one pressing question here that needs to be answered. If the model can learn whatever signals are present in the echogram, won't that mean that it would also learn the underlying noise that remained in the data and then continue to produce erroneous predictions? Well, we have news for you our dear reader:

These models extract patterns in the underlying data, that's the essence of them 'learning'! And since noise is random and doesn't really have a set pattern associated with it, it is not learned.
Also when taking a specific target organism(say, Jellyfish) the presence of other organisms(like Salmons) is rarely an issue because their numbers are very low in the dataset, and hence are not confused with the Jellyfish patterns.

Judgement Day

Finally, it all comes down to actually verifying how our models performed. But we need a metric for that right? A set rule which we can use to asses their capabilities. For this scenario, the authors have chosen Precision and Recall. But don't you slam the proverbial door on me yet, let us get a bird's eye view of what they represent.

Precision quite literally would mean, out of all the specs that the model labelled as Jellyfish, how many actually were one. So basically:

How many I correctly labelled Jellyfish

How many I correctly labelled Jellyfish + How many I incorrectly labelled Jellyfish

Recall on the other hand means, out of all the specs that actually were Jellyfish, how many did the model label correctly. So:

How many I correctly labelled Jellyfish

How many I correctly labelled + How many I incorrectly said were not Jellyfish

Let that sink in , it takes a 'lil effort, that's all there is to it. So what do WE want for OUR models you ask? Well, if you are looking for the ability to very accurately say that something is a Jellyfish or a salmon or a air bubble, you are looking for high precision.

While if you want the ability to sorta get hold of as many Jellyfish as possible with a salmon or a fish school in the bunch, you're looking for high recall.

And the best thing about the study is that they found both. Now always remember that you cannot have high recall and high precision, both at the same time. What I mean here is that since the researchers tried out various models, some of them had a high enough precision and some of them had a decent recall to boast off.

In the end, the models significantly outperformed the annotations and can mark a significant step in improving both the research efficacy of the echogram data and also speeding up the very task of annotating them, so that Biologists can focus on patterns that matter more, and make sure that Nemo doesn't need rescuing from a dentist's office aquarium. AGAIN!!

Finding Nemo....But with Deep Learning!

Into the Semantics of it all

Recent Posts

Comments