Why normalize images by subtracting dataset's image mean, instead of the current image mean in deep learning?
There are some variations on how to normalize the images but most seem to use these two methods:
- Subtract the mean per channel calculated over all images (e.g. VGG_ILSVRC_16_layers)
- Subtract by pixel/channel calculated over all images (e.g. CNN_S, also see Caffe's reference network)
The natural approach would in my mind to normalize each image. An image taken in broad daylight will cause more neurons to fire than a night-time image and while it may inform us of the time we usually care about more interesting features present in the edges etc.
Pierre Sermanet refers in 3.3.3 that local contrast normalization that would be per-image based but I haven't come across this in any of the examples/tutorials that I've seen. I've also seen an interesting Quora question and Xiu-Shen Wei's post but they don't seem to support the two above approaches.
What exactly am I missing? Is this a color normalization issue or is there a paper that actually explain why so many use this approach?
I don't know the answer, but have you tried each of the method? Is their any difference in the performances?
@user112758 - implementing them is a little painful (especially for the by-pixel) and my experience is that normalizing per image works fine but my data is not that representative. I'll try to experiment with the normalization but I'm curious to hear the motivation behind these (in my mind) strange normalization procedures.
Subtracting the dataset mean serves to "center" the data. Additionally, you ideally would like to divide by the sttdev of that feature or pixel as well if you want to normalize each feature value to a z-score.
The reason we do both of those things is because in the process of training our network, we're going to be multiplying (weights) and adding to (biases) these initial inputs in order to cause activations that we then backpropogate with the gradients to train the model.
We'd like in this process for each feature to have a similar range so that our gradients don't go out of control (and that we only need one global learning rate multiplier).
Another way you can think about it is deep learning networks traditionally share many parameters - if you didn't scale your inputs in a way that resulted in similarly-ranged feature values (ie: over the whole dataset by subtracting mean) sharing wouldn't happen very easily because to one part of the image weight
wis a lot and to another it's too small.
You will see in some CNN models that per-image whitening is used, which is more along the lines of your thinking.
Thank you for the answer. I'm familiar with the concept of centering the data and making the sure the range is similar in order to get stable gradients. The question is more of why we need to do this over the entire dataset and why this would help in contrast to per-image whitening? I would like a simple reference that shows in some way that this improves learning before I accept the answer. I know that batch normalization is an incredibly powerful technique but I don't see the connection to entire dataset normalization.
If you accept batch normalization is good, then you're already there. The only reason you batch normalize is when you can't fit the full dataset in memory or you're distributing the training (often the same issue). That's why we have batches.
I thought that batches are also the foundation for stochastic gradient descent. Even if I could fit everything into memory I want to update the parameters more frequently than after each epoch.
They are. And you can update however frequently you want - the analytical implications are identical which is what's so nice and scalable about gradient descent. The reason that we use *stochastic* gradient descent (shuffling input order + batching) is to smooth out our hill climbing through gradient space. Given a single point we can't really be sure our update will push us in the direction of local maxima, however if you select enough points, this likelihood becomes higher (in expectation).
I've noticed that some networks only subtract the mean, but don't divide by stddev. I've wondered why they only subtract the mean. But at the same time pixel features for a standard RGB image is only going to have an 8 bit depth from 0 to 255, so there's unlikely to be too much variation in RGB images.
How does this help get features into a similar range? If I have two images, one ranging from 0 to 255 and one ranging from 0 to 50 in pixel values, say with a mean of 50 and stdev of 15. Normalizing gives me image 1 ranging from -3.3 to 13.6 and image 2 ranging from -3.3 to 0. They still aren't in the same scale.
@MaxGordon, just seeing this post after a few years, I realize you meant batchnorm, not stochastic gradient descent while batching (which I was referring to). Very different concepts. For future readers, batchnorm is for reducing covariate shift (difference between train & test distributions), preventing vanishing gradients, and as a side effect, acts as a regularizer.
@Daniel, not every image after normalizing will be "in the same range", but as a group, a normalized feature's values will. See: https://en.wikipedia.org/wiki/Standard_score