Deep Learning with neural networks is something I initially started Googling about but could never find a good enough article that started with the foundations. If like me, you started finding articles online that compared NN's with brain synapses and the like; I found those just added to the confusion. It wasn't until recently that I completed the Deep Learning course by Andrew Ng on Coursera that things made a whole lot of sense. If you have the time and money for it, I highly recommend it.
In this post, I want to go over the beginnings of what I learned, which subsequently helped me at least gain the intuition over the major concepts in deep learning. This greatly helped in understanding for meetups and other more in depth articles online. I also have a slide deck that I've been using to explain these concepts to others and will be using key slides here. Let's focus on understanding these common things:
- What a neural network looks like using a small 2 layer NN
- Forward and Backward propagation
- Activation functions
- Cost function optimization using Gradient Descent
A 2 Layer NN
Here's a typical diagram all over the web of a 2 layer NN for logistic regression:
It's only 2 layers because we don't count the input layer. yhat is the predicted output. In logistic regression, we're going to get a value between 0 and 1.
The hidden layer is called a Dense layer or Fully Connected layer because the previous nodes all have connections to each one of the hidden layer's nodes. They are the hidden features that have an influence on the final outcome of the prediction.
How do we determine the number of nodes that should belong in the hidden layer and how many layers should there be? These are actually called hyper parameters. They're parameters we just have to provide using knowledge, intuition and past experience. We keep adjusting them until we get a good result. More on that later.
So what does the NN do when it's "trained"? Each pass from one layer to the next actually does a computation with a matrix of weights. The NN's job is to guess what values of weights give us an accurate prediction of yhat. The first step is to multiply the weights by the inputs and optionally add a bias value. You can think of the biases as adjusting a linear 2d function up and down. Sometimes NN's are trained without biases.
We then apply an activation function to that value. More on that later also. First let's look at what that notation looks like:
In order to train this network, we need many examples and many training sets so that it can keep adjusting the weights until it has a high enough accuracy. Once we have a high enough accuracy, we can give it a brand new set of inputs to get the predicted value (yhat).
To efficiently do so, we can vectorize the inputs, weights and biases and then perform the computation that way:
- For matrix W (weights), in the hidden layer [l], it's dimensions will be (# nodes in current layer, # nodes in the last layer).
- For matrix b (biases), the dimensions would be (# nodes in current layer, 1).
Below is actually how we go through
m training examples to calculate a prediction. The brackets
(i) denotes the
(i)th training example in set
This is how it typically works, a linear function, followed by a non linear activation function to get the value of the node in each layer. Don't worry about the sigma function in the activation functions because we'll talk about that next.
We apply a non-linear function each time because not everything in the world is a simple if x then y model. If all activation functions were linear, then we might as well just compute it all instead of having multiple hidden layers.
The only time we might want a linear function as the activation function is when we desire such a value in our final output.
In the case of logistic regression, we want our final output to be between the values 0 and 1. We can then make a call,
if yhat > threshold e.g. 0.5, we can say that's a 1 (e.g. cat) or 0 (not a cat) if we were classifying cats. That's exactly what a sigmoid function does.
Here are some other commonly used activation functions:
- sigmoid gives us a value between approx 0 and 1
- tanh is a normalized version of sigmoid which gives us a value between approx. -1 and 1 and goes through the origin
- ReLU (Rectified Linear Unit) has been found to really help increase the learning rate and we'll see why in a bit
Cost Function - Logistic Regression
Let's take a look at the formulas, and dissect it a little bit:
First, the Loss Function is calculated each time we go iterate through one training example. It's a measure of the "inconsistency between our predicted value (yhat) and the [actual value in our training] set (y)" -- (loss function) . It's really composed of two parts:
Scenario: y value is 1
Regardless of what we predicted in yhat, when
y = 1, we only care about the first half of the equation. Because if you look at the second half and substitute y=1, you get
0 * something. 0 multiplied by anything will always be 0 so we can ignore the second half.
Now, after running the sigmoid computation, we should get a value between 0 and 1. If we predicted close to 1, the loss is actually close to 0. Because the
log(1) = 0. If we guessed right, there should be no loss. Makes sense. The log of a small number is another small negative number. So we multiply it by the negative in front to make it positive.
I think you get the idea as similar logic applies to the other half of the equation when
y = 0.
The cost function is the average of the Loss function over the entire training set.
The big sum sign basically tells us to add all the losses together over m training examples and then multiply by
1/m. Multiplying by
1/m is the same as dividing by m. So it's just a mean calculation.
What the NN wants to do is to optimize the networks (its weights) so that this Cost function is as small as possible.
Forward and Backward Propagation
Forward Propagation is what we've been mostly focused on so far where we go through the network with a linear function followed by an activation function until we get a prediction.
How the network learns is actually through backward propagation where we calculate the derivatives so that we know how much to update the weights. Let's visually take a look at what that looks like step by step in the 2 layer NN so far:
In modern frameworks such as Tensorflow, we don't have to worry about the calculus for backward prop.
Backward Propagation with Gradient Descent
Gradient descent is a way of finding the minimum of a function. Because we're trying to optimize for the Cost to be as low as possible, we want to find the minimum Cost. We can see this and the formula for updating the weights and biases in the below slide:
Here we have another hyper parameter, alpha. This is the learning rate and it's not uncommon to see it set to
0.001. The learning rate is the number of steps to take as we try to reach the minimum If we have too low of a learning rate, the NN takes a long time to reach the minimum. If it's too high, the steps taken (yellow arrow) as we try to reach the minimum, might be too large and we don't reach it. The network actually starts bouncing between low and high accuracy if it's too large.
Once we've updated the weights and biases, we run through the network again and again until our accuracy increases. This is truly where the "learning" happens.
Running through the entire training set
m once is an epoch. The number of epochs we train the NN for is also a hyper parameter.
Congratulations for reading this far! Now you should have a sense of what a neural network consists of. Have intuition on how learning works through forward and backward propagation. Understand what an activation function really is and what the common ones are.
That's it for theory. Hopefully this gives you an intuition and some of the common terms used for further research. At one point, I'll put up some code in Tensorflow on this exact network to demonstrate the concepts in action.
Please comment if you liked where this is going and if you would like to see more.