In out previous post (which was wayyyy back thanks to Lazy me), we talked about how we have a lot of options in terms of the kind of model that you'd want to pick for your task. So naturally noobs like me and you find themselves in the dark.
But while working on a recent project I was faced with yet another dilemna, what kind of activation function should I choose for my model. Woah woah hold up, we don't know what activation functions are, right?
Hmm...well imagine it this way, offering Candies and flowers every time your partner gets grumpy, doesn't work. Many situations involve good communications that are more than often, awkward and twisted. Ahh...the word twisted...so basically the point of this is to tell you that Neural Networks or the other set of Machine Learning Models, are meant to capture insights in the real world data, which as we just saw aren't LINEAR all the time.
So so so?? Simple machine learning models perform these "magical" mathematical calculations(Nah! More like "middle" School) that Look someting like....(actual equations- Written in fancy notation)
Yep... you guessed it, they are a lot like equations of lines hence the term LINEAR. So if we only stacked layers of these Linear Equations, our model would just spit out one big equation of line and utterly fail to capture any real world insight/relationship in our data. Hence we need to introduce a lil' twist in their outputs and this can be done through activation functions! These essentially help the models handle Complex data better.
So I present to you some of the activation functions that I came across and had to choose from.
Sigmoid
Now if I by any chance have made you curious to delve deeper into this topic, if you'd go to look up for the sigmoid function, you would find something that looks like.....
THIS!!!
Which is very intimidating right? I mean first of all there is this spooky greek symbol notation, then you have something like e to the power negative x, and that too in the denominator. And why would someone use this as an activation function?
Okay okay, There's one picture that can answer all your questions, and that is the graph of the sigmoid function. Let's have a look.
Image Credits: TowardsDataScience
So it's easy to see that no matter what the value of x is for the sigmoid function, the output always stays between 0 and 1. And this does a magical thing:
It makes sure that the mean of your outputs is closer to 0.5 or 0, which in turn speeds up learning.
Okay, tough to grasp? Think about it this way, if the sigmoid function is scaling the outputs to be between 0 and 1, the data seems less spread out right, and it's easier for a Machine Learning model to get hold of values close to them as the range is not that big. So there's the sigmoid function.
Next up we have something that some of us do not like or maybe even downright hate it...
tanh
So we have the hyperbolic tangent in the picture, now don't worry if you don't have a good grasp on hyperbolic trigonometry here, heck even trigonometry isn't required, and again let me first scare you with the mathematical formula, then I'll take you to the beautiful part of it.
I totally get it, even when I saw the notation for the first time, I was in utter disbelief as to HOW on earth will this serve the role of an "activation function" !!
But again, the answer lies in the graph of it all...
Now, this looks a lot similar to the sigmoid graph above right, but notice the output range, it lies not between 0 and 1 it's between -1 and 1. Hmm... now HOW does THAT help??
Okay since the tanh activation function is scaling the output between -1 and 1, the mean is gonna be even closer to 0. Hence the data now is even easier to learn for the model. Soooo, tanh is a better activation function than sigmoid.
So One key takeaway from the above details on activation functions is:
If the activation function scales the data in such a way that it ends up with a mean closer to 0, it is usually better and helps speed up the learning.
But there's one domain where the sigmoid activation function can be the preferred choice.
Suppose you want to predict the probability of a particular image being that of a cat, so it wouldn't make sense to interpret negative probabilites. So instead just use the sigmoid function to map the output in the range 0-1, and whatever images are mapped to or above 0.5 can be taken as those of cats.
Other than that, in all the other layers and the other cases, tanh usually works better.
Now let's do some justification to the title of the Blog: "Not So Leaky"...
If you pay close attention, you will find that in graphs of both the tanh and the sigmoid function, the function starts getting flatter and flatter as we tend to higher/lower values. Which essentially makes the slope zero, which affects learning once more.
How exactly??Okay hear me out:
So our machine learning models try and capture the features/insights in our data via a few parameters called weights and they are fine tuned in such as way as to best capture those insights. But the fine tuning depends on how the output changes with every change in the weights. So essentially the flattening of the graph and the slope getting to zero means that the output has stopped changing much, which then affects our fine tuning. But there's no given that the weights have reached their optimal convergent values, hence when the slope reaches 0 we might just be stuck in a moment in training.
So what's the solution to this? Well, trust me on this, you might not believe this, just because it is so simple. The answer to the above dilemna is:
ReLU
ReLU(cute right? I know) or Rectified Linear Unit for the win!! Now if you ignore the fancy name that they have given to the function, you will realise that it is actually doing a really simple job, if x is a positive number then it spits out x or else it spits out 0.
That's it!
Now again, to understand how this helps us solve the learning or as they call it "vanishing gradient" problem we must have a look at the function's graph:
So how is it solving the problem you ask. See basically the slope of the activation function is 1 and does not vanish as long as your values are positive, and usually in most machine learning networks, there are enough positive outputs that they can atleast prevent the learning from stagnating, and note the magic here, all of this just because the derivative of that thingy above is not 0 in certain regions.
Still not satisfied?
Hmm... You might say, a neat escape but what if my negative values are also crucial, and might actually influence the learning? We'd be missing out on a lot right?
Well, they have solved this problem too!! By tweaking the function just a bit so that the derivative(slope of the graph) for negative values isn't 0....and that is....
Leaky ReLU
Now this slight addition might not make much sense, hell I was confused at first. But as you know, we like visual depictions, so here's the graph, take some time to grasp why it is the way it is...
So now, even for negative values of x, the derivate doesn't disappear completely, hence the learning is not completely zero for negative outputs. This also highlights the ultimate fact that learning, when done regularly and in small steps can also take us a long way :)
P.S.- For those who want to delve depper into the world of Machine Learning, I would highly recommend Andrew Ng's Course on Coursera.
Comments