Tradeoff batch size vs. number of iterations to train a neural network
When training a neural network, what difference does it make to set:
- batch size to $a$ and number of iterations to $b$
- vs. batch size to $c$ and number of iterations to $d$
where $ ab = cd $?
To put it otherwise, assuming that we train the neural network with the same amount of training examples, how to set the optimal batch size and number of iterations? (where batch size * number of iterations = number of training examples shown to the neural network, with the same training example being potentially shown several times)
I am aware that the higher the batch size, the more memory space one needs, and it often makes computations faster. But in terms of performance of the trained network, what difference does it make?
From Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. https://arxiv.org/abs/1609.04836 :
The stochastic gradient descent method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, usually 32--512 data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. There have been some attempts to investigate the cause for this generalization drop in the large-batch regime, however the precise answer for this phenomenon is, hitherto unknown. In this paper, we present ample numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions -- and that sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We also discuss several empirical strategies that help large-batch methods eliminate the generalization gap and conclude with a set of future research ideas and open questions.
The lack of generalization ability is due to the fact that large-batch methods tend to converge to sharp minimizers of the training function. These minimizers are characterized by large positive eigenvalues in $\nabla^2 f(x)$ and tend to generalize less well. In contrast, small-batch methods converge to flat minimizers characterized by small positive eigenvalues of $\nabla^2 f(x)$. We have observed that the loss function landscape of deep neural networks is such that large-batch methods are almost invariably attracted to regions with sharp minima and that, unlike small batch methods, are unable to escape basins of these minimizers.
Also, some good insights from Ian Goodfellow answering to why do not use the whole training set to compute the gradient? on Quora:
The size of the learning rate is limited mostly by factors like how curved the cost function is. You can think of gradient descent as making a linear approximation to the cost function, then moving downhill along that approximate cost. If the cost function is highly non-linear (highly curved) then the approximation will not be very good for very far, so only small step sizes are safe. You can read more about this in Chapter 4 of the deep learning textbook, on numerical computation: http://www.deeplearningbook.org/contents/numerical.html
When you put m examples in a minibatch, you need to do O(m) computation and use O(m) memory, but you reduce the amount of uncertainty in the gradient by a factor of only O(sqrt(m)). In other words, there are diminishing marginal returns to putting more examples in the minibatch. You can read more about this in Chapter 8 of the deep learning textbook, on optimization algorithms for deep learning: http://www.deeplearningbook.org/contents/optimization.html
Also, if you think about it, even using the entire training set doesn’t really give you the true gradient. The true gradient would be the expected gradient with the expectation taken over all possible examples, weighted by the data generating distribution. Using the entire training set is just using a very large minibatch size, where the size of your minibatch is limited by the amount you spend on data collection, rather than the amount you spend on computation.
Since batch_size only divides the training data set into batches, would it make sense to rearrange the dataset (non temporal) to have uniform variance across all batches? Doing so might reduce the need for batch size optimization, which only is good to find faster convergence . if so, how would it be done? I was thinking it may not provide a flatter minima. Would appreciate detailed guidance.
I assume you're talking about reducing the batch size in a mini batch stochastic gradient descent algorithm and comparing that to larger batch sizes requiring fewer iterations.
Andrew Ng. provides a good discussion of this and some visuals in his online coursera class on ML and neural networks. So the rest of this post is mostly a regurgitation of his teachings from that class.
Let's take the two extremes, on one side each gradient descent step is using the entire dataset. You're computing the gradients for every sample. In this case you know exactly the best directly towards a local minimum. You don't waste time going the wrong direction. So in terms of numbers gradient descent steps, you'll get there in the fewest.
Of course computing the gradient over the entire dataset is expensive. So now we go to the other extreme. A batch size of just 1 sample. In this case the gradient of that sample may take you completely the wrong direction. But hey, the cost of computing the one gradient was quite trivial. As you take steps with regard to just one sample you "wander" around a bit, but on the average you head towards an equally reasonable local minimum as in full batch gradient descent.
This might be a moment to point out that I have seen some literature suggesting that perhaps this bouncing around that 1-sample stochastic gradient descent does might help you bounce out of a local minima that full batch mode wouldn't avoid, but that's debatable. Some other good answers here address this question more directly than I have.
In terms of computational power, while the single-sample stochastic GD process takes many many more iterations, you end up getting there for less cost than the full batch mode, "typically." This is how Andrew Ng puts it.
Now let's find the middle ground you asked about. We might realize that modern BLAS libraries make computing vector math quite efficient, so computing 10 or 100 samples at once, presuming you've vectorized your code properly, will be barely more work than computing 1 sample (you gain memory call efficiencies as well as computational tricks built into most efficient math libraries). And averaging over a batch of 10, 100, 1000 samples is going to produce a gradient that is a more reasonable approximation of the true, full batch-mode gradient. So our steps are now more accurate, meaning we need fewer of them to converge, and at a cost that is only marginally higher than single-sample GD.
Optimizing the exact size of the mini-batch you should use is generally left to trial and error. Run some tests on a sample of the dataset with numbers ranging from say tens to a few thousand and see which converges fastest, then go with that. Batch sizes in those ranges seem quite common across the literature. And if your data truly is IID, then the central limit theorem on variation of random processes would also suggest that those ranges are a reasonable approximation of the full gradient.
Deciding exactly when to stop iterating is typically done by monitoring your generalization error against an untrained on validation set and choosing the point at which validation error is at its lowest point. Training for too many iterations will eventually lead to overfitting, at which point your error on your validation set will start to climb. When you see this happening back up and stop at the optimal point.
TL;DR: Too large a mini-batch size usually leads to a lower accuracy!
For those interested, here's an explanation.
There are two notions of speed:
- Computational speed
- Speed of convergence of an algorithm
Computational speed is simply the speed of performing numerical calculations in hardware. As you said, it is usually higher with a larger mini-batch size. That's because linear algebra libraries use vectorization for vector and matrix operations to speed them up, at the expense of using more memory. Gains can be significant up to a point. From my experience, there is a point after which there are only marginal gains in speed, if any. The point depends on the data set, hardware, and a library that's used for numerical computations (under the hood).
But, let's not forget that there is also the other notion of speed, which tells us how quickly our algorithm converges.
Firstly, what does it mean for our algorithm to converge? Well, it's up to us to define and decide when we are satisfied with an accuracy, or an error, that we get, calculated on the validation set. We can either define it in advance and wait for the algorithm to come to that point, or we can monitor the training process and decide to stop it when the validation error starts to rise significantly (the model starts to overfit the data set). We really shouldn't stop it right away, the first moment the error starts to rise, if we work with mini batches, because we use Stochastic Gradient Descent, SGD. In case of (full batch) Gradient Descent, after each epoch, the algorithm will settle in a minimum, be it a local or the global one. SGD never really settles in a minimum. It keeps oscillating around it. It could go on indefinitely, but it doesn't matter much, because it's close to it anyway, so the chosen values of parameters are okay, and lead to an error not far away from the one found at the minimum.
Now, after all that theory, there's a "catch" that we need to pay attention to. When using a smaller batch size, calculation of the error has more noise than when we use a larger batch size. One would say, well, that's bad, isn't it? The thing is, that noise can help the algorithm jump out of a bad local minimum and have more chance of finding either a better local minimum, or hopefully the global minimum.
Thus, if we can find a better solution more quickly by using a smaller batch size instead of a larger one, just by the help of the "unwanted" noise, we can tune between the total time it takes for our algorithm to find a satisfactory solution and a higher accuracy.
What I want to say is, for a given accuracy (or error), smaller batch size may lead to a shorter total training time, not longer, as many believe.
Or, if we decide to keep the same training time as before, we might get a slightly higher accuracy with a smaller batch size, and we most probably will, especially if we have chosen our learning rate appropriately.
If you have time, check out this paper: Systematic evaluation of CNN advances on the ImageNet Especially, check out "3.7. Batch size and learning rate", and Figure 8. You will see that large mini-batch sizes lead to a worse accuracy, even if tuning learning rate to a heuristic.
In general, batch size of 32 is a good starting point, and you should also try with 64, 128, and 256. Other values (lower or higher) may be fine for some data sets, but the given range is generally the best to start experimenting with. Though, under 32, it might get too slow because of significantly lower computational speed, because of not exploiting vectorization to the full extent. If you get an "out of memory" error, you should try reducing the mini-batch size anyway.
So, it's not simply about using the largest possible mini-batch size that fits into memory.
To conclude, and answer your question, a smaller mini-batch size (not too small) usually leads not only to a smaller number of iterations of a training algorithm, than a large batch size, but also to a higher accuracy overall, i.e, a neural network that performs better, in the same amount of training time, or less.
Don't forget that the higher noise can help it jump out of a bad local minimum, rather than leaving it stuck in it.
I'm adding another answer to this question to reference a new (2018) ICLR conference paper from Google which almost directly addresses this question.
Title: Don't Decay the Learning Rate, Increase the Batch Size
The abstract from the above paper is copied here:
It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate ϵ and scaling the batch size B∝ϵ. Finally, one can increase the momentum coefficient m and scale B∝1/(1−m), although this tends to slightly reduce the test accuracy. Crucially, our techniques allow us to repurpose existing training schedules for large batch training with no hyper-parameter tuning. We train ResNet-50 on ImageNet to 76.1% validation accuracy in under 30 minutes.
A larger memory requirement seems a bad trade-off for simply avoiding to decrease a value. Also IMHO having the memory footprint growing during training makes for a less, not more, scalable algorithm.
More memory is not necessarily required for larger batch sizes. One common method is to accumulate the gradients over several normal sized batches and then perform one single weight update using the average/sum of the gradients. This results in virtually using a larger batch size. Example: https://stackoverflow.com/questions/46772685/how-to-accumulate-gradients-in-tensorflow
I show some empirical experience here. I did an experiment with batch size 4 and batch size 4096. The size 4096 is doing 1024x fewer backpropagations. So my intuition is that larger batches do fewer and coarser search steps for the optimal solution, and so by construction will be less likely to converge on the optimal solution.