Stochastic Gradient Descent

12 min readNov 21, 2020

Note: In case the math format does not show up, please add Math Anywhere through the link to your browser.

Gradient Descent is the most important method we use to find the local minimum of a differentiable function. But it always come into problem when we deal with a huge dataset, because gradient descent takes all the sample data in the training set and do a single update for a parameter in an iterative manner to minimize the loss function. It is less efficient as getting parameters updated, we need to run the whole data set. Here comes the idea of Stochastic Gradient Descent, which behaves differently. Here we talk about how stochastic gradient descent works and what is the difference between a normal gradient descent and a stochastic gradient descent algorithm.

Gradient Descent

Before getting into Stochastic Gradient Descent, let us take a look of basic gradient descent.

What is the objective of Gradient Descent? In plain terms, gradient means the slop or slant of a surface. So gradient descent literally means descending a slope to reach the lowest point on that surface. In two dimensional graph, take parabola as an example,

a parabolic function with two dimensions (x,y)

The lowest point on the parabola occurs at x =1. The objective of gradient descent algorithm is to find the value of “x” such that “y” is minimum. “y” here is interpreted as objective function

In the standard gradient descent, the basic update rule is

where $w_{j,i}$ represents weights, which is also the parameters we are trying to tune when we want to optimize neural networks. $L$ here is the loss function and we are trying to minimize it. $\delta w_{j,i}$ decides how we change this weight to decrease the loss.

It is evident that on every step, the entire training set is used and that is also the reason it is called batch gradient descent. Such calculation is expensive on a very large dataset, also, it has to store intermediate weights in memory on each iteration. Graphical interpretation of gradient descent is as follows,

Illustration of gradient descent on a series of level sets from wikipedia

To better illustrate how gradient descent works for finding the local minima, we take a example of finding the local minima of function $f(x)=x³−2x²+2$. The code is showing below,

From the plot above we can see the local minima will be around 1.4, but assume we do not know about it, and choose a starting point at $x_0 =2$.

By collecting all the points we go through to reach the local minima, we can plot the trajectory of how gradient descent went through,

Stochastis Gradient Descent

The word ‘stochastic‘ means a system or a process that is linked with a random probability. Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for each iteration. In Gradient Descent, there is a term called “batch” which denotes the total number of samples from a dataset that is used for calculating the gradient for each iteration. In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to be the whole dataset. Although, using the whole dataset is really useful for getting to the minima in a less noisy and less random manner, but the problem arises when our datasets gets big. Suppose, you have a million samples in your dataset, so if you use a typical Gradient Descent optimization technique, you will have to use all of the one million samples for completing one iteration while performing the Gradient Descent, and it has to be done for every iteration until the minima is reached. Hence, it becomes computationally very expensive to perform.

This problem is solved by Stochastic Gradient Descent. In SGD, it uses only a single sample, i.e., a batch size of one, to perform each iteration. The sample is randomly shuffled and selected for performing the iteration.

In stochastic gradient descent, the algorithm uses a single or a few training examples to calculate parameters. So unlike the batch gradient descent, rather than waiting to sum up all the gradient terms (for all the training data points) before we can calculate one change in parameters, SGD computes the gradient term for just one single data and starts to make progress to improve the parameter already. Thus, in SGD, the change in parameters is going to be much faster. As SGD is computationally less demanding than GD since we don’t have to run through the whole training set to update the weights, it is a good choice for dealing with huge datasets.

The algorithm can be described as,

For i in range(m):

$\theta_j = \theta_j -\alpha(\hat{y}^i -y^i) x_j^i$

In SGD, since only one sample from the dataset is chosen at random for each iteration, the path taken by the algorithm to reach minima is usually noisier than typical Gradient Descent algorithm. But that does not matter much because the path taken by the algorithm does not matter, as long as we reach the minima and with significantly shorter training time. For comparison, the path taken by Batch Gradient Descent is,

Path taken by Stochastic Gradient Descent,

The thing need to be noticed is that, as SGD is generally noisier than Gradient Descent, it usually took a higher number of iterations to reach the minima, because of randomness in its descent. Even though it requires a higher number of iterations to reach the minima than typical Gradient Descent, it is still computationally mush less expensive than Gradient Descent. Hence, in most scenarios, SGD is preferred over Batch Gradient Descent for optimizing a learning algorithm.

Algorithm in python

Pseudo code for SGD in Python are,

The cycle of taking the values and adjusting them based on different parameters in order to reduce the loss function is called back-propagation.

Why we need Stochastic Gradient Descent?

let’s take a look at a real example by using gradient descent first. Consider a simple linear regression where we want to see the temperature affects the noises made by crickets. We have a data set of crickets chirp rates at various temperatures. First we will load the data set and plot it.

The goal is to find the best line$h_{\theta}(x) = \theta_0 +\theta_1 x$ fits our data points. The function that we are trying to minimize here is,

$J(\theta_0,\theta_1) = {1 \over 2m} \sum\limits_{i=1}^m (h_\theta(x_i)-y_i)²$

In this case, the gradient will be defined in two dimensions,

$\frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1) = \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x_i)-y_i)$

$\frac{\partial}{\partial \theta_1} J(\theta_0,\theta_1) = \frac{1}{m} \sum\limits_{i=1}^m ((h_\theta(x_i)-y_i) \cdot x_i)$

Below, we set up our function for h, J and the gradient:

Then load the data into x and y variables, and run our gradient descent algorithm:

For comparision, we get the actual values of coefficients as below,

Also, best fit line drew as below,

We can see the values we got by applying gradient descent algorithm are pretty close to actual ones.

But, here comes a real problem, the dataset we manipulate is a small dataset just have 15 points. And we need to calculate the gradient in every step of our algorithm, in this example, it is not a big deal, but image if we have a dataset contains million data points, this method will be way less efficient. Here Stochastic Gradient Descent came to rescue!

As what we did in the above example, we must consider every example in the entire training set on every step. In Stochastic Gradient Descent, we update our values after looking at each item in the training set, so we start making progress right away. Recall the regression example above, we calculate the gradient for each of the two theta values as follows,

$\frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1) = \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x_i)-y_i)$

$\frac{\partial}{\partial \theta_1} J(\theta_0,\theta_1) = \frac{1}{m} \sum\limits_{i=1}^m ((h_\theta(x_i)-y_i) \cdot x_i)$

where $h_\theta = \theta_0 +\theta_1 x$.

Then we followed the algorithm (where αα\alpha is a non-adapting stepsize):

1. Choose initial guess $x_0$.

2. For k = 0, 1, 2, … do

$s_k = — \nabla f(x_k)$

$x_{k+1} = x_k +\alpha s_k$

end for.

when the sample data is small, as above, calculating the gradient was not very costly. But for large data sets, this would not be the case. So instead, we consider stochastic gradient descent algorithm for simple linear regression such as following, where m is the size of the data set:

1. Randomly shuffle the data set

2. For k = 0, 1, 2, … do

For i = 1 to m do

$\begin{pmatrix} \theta_1 \\ \theta_2 \end{pmatrix}$ = $\begin{pmatrix} \theta_1 \\ \theta_2 \end{pmatrix}$ $- \alpha \begin{pmatrix} 2(h_{\theta}(x_i) — y_i ) \\ 2x_i (h_{\theta}(x_i) — y_i) \end{pmatrix}$

end for

end for.

Typically, with stochastic gradient descent, you will run through the entire data set 1 to 10 times (see value for k in line 2 of the pseudocode above), depending on how fast the data is converging and how large the data set is.

Unlike gradient descent, stochastic gradient descent will tend to oscillate near a minimum value rather than continuously getting closer. It may never actually converge to the minimum though. One way around this is to slowly decrease the step size as the algorithm runs. However, this is less common than using a fixed.

Let’s look at another example where we illustrate the use of stochastic gradient descent for linear regression. In the example below, we’ll create a set of 500,000 points around the line, for values of x between 0 and 100:

First step is randomly shuffle the dataset, and then set up h function and cost function:

Then, we will run our stochastic gradient descent algorithm, to see its progress, we will take a cost measurement at every step, and every 10,000 steps, we will get an average cost from the last 10,000 steps and then append that to our cost_list variable. Run the entire list 10 times here,

Clearly, the result values are close to their true values 17 and 2.

Plot the cost versus the number of iterations. The cost goes down quickly at first, but start to level off as we go through more iterations.

The difference between gradient descent and stochastic gradient descent

let’s jump out of math and take a look of an example how batch gradient descent and stochastic gradient descent updates differently in d real dataset. Suppose in the following case, our neural network has to predict the percentage that person will buy the house or not, based on his/her salary, spending and saving, we will have actual output “Buy House” as y.

So in normal gradient descent, we take all of our rows and plug them into a neural network. After plugging into the neural network, we calculate the cost function with the help of formula cost function = $\frac{1}{2}(y — \hat{y})²$. Based on the cost function, we adjust the weights. This is a process been going on when we apply a batch gradient descent. So in this case, we take the whole batch from our sample.

Although, in stochastic gradient descent, we take the rows one by one. Take a row, run a neural network and based on the cost function, we adjust the weight. Then we move to second row, run the neural network, based on the cost function, we update the weight. This process repeats for all other rows. So, in stochastic, basically, we are adjusting the weights after every single row rather than doing everything together and then adjusting weights. To understand more clearly, see the images below,

why we need to randomly shuffle our dataset?

We need to randomly shuffle the training data because order data can introduce bias into gradient. To see a clearer picture of that, let’s start with loss function L(θ,x), where $\theta$ is parameter and $x_1, x_2, …, x_N$ is a point of data, as we are trying to minimize loss function on training data,

$J (\theta) = \sum_{x \in training set}$ $L(\theta, x)$

If we do not shuffle the data, assuming the training data is ${x_1, x_2, …, x_N}$, fed to stochastic gradient descent, then the gradient at any particular step is the gradient of a fixed function, they are,

Each of these $g_j$ is biased estimate of ∇θ J(θ).Why is that? Take $E[g_2]$ for example, we have,

$g_2$ is constant. There is nothing random about it. If we know our dataset and do not ranmize it, then $g_2$ as well as other g′s are all known, they are fixed. Their values are their own expectations and no more uncertainty. The worse scenario will be: these values are not the true values of the gradient we want to compute. This phenomenon is biased estimate, and is generally not preferred.

Now take a look what happens if we do randomise the data before training. None of the gradients $g_i$ is determined anymore. All of them have the expected value,

and this is the mean gradient of the whole training data. If M is large, the laws of large number say that it approaches its theoretical mean, i.e. the true gradient we want to estimate. In addition, in any case, the empirical gradient that we use to update our parameters has expected value equal to the true gradient.

In conclusion, randomizing or permutating the training data gives an unbiased estimate of the true gradient, leading to better updates which enhance the training process.

Highlighting the differences between Gradient Descent and SGD

Batch gradient descent needs to collect lots of feedback before making adjustments, but needs to do fewer adjustments.

Stochastic gradient descent makes many small adjustments, but spends less time collecting feedback in between.

Batch gradient descent preferable if the full population is small, stochastic gradient descent preferable if the full population is very large.

Batch gradient descent methods can be made parallel if you have access to more hardware (in this case, more tailors and materials) as you can collect all feedback in parallel.

Stochastic gradient descent does not readily lend itself to parallelization as the you need the feedback from one iteration to proceed with the next iteration.

In conclusion, the basic gradient descent is more suitable for a small dataset and less computationally expensive. When deal with a large data set, SGD comes as a better idea, even though there are some oscillation during the updating process, but it gets faster to reach the local minimum point. As the path it goes through does not matter in the optimization process, SGD is still preferred even though it takes more adjustments during the way to reach the optimal point.