How to choose the number of hidden layers and nodes in a feedforward neural network?
Is there a standard and accepted method for selecting the number of layers, and the number of nodes in each layer, in a feed-forward neural network? I'm interested in automated ways of building neural networks.
Among all the great answers, i found this paper helpful http://dstath.users.uth.gr/papers/IJRS2009_Stathakis.pdf
I realize this question has been answered, but I don't think the extant answer really engages the question beyond pointing to a link generally related to the question's subject matter. In particular, the link describes one technique for programmatic network configuration, but that is not a "[a] standard and accepted method" for network configuration.
By following a small set of clear rules, one can programmatically set a competent network architecture (i.e., the number and type of neuronal layers and the number of neurons comprising each layer). Following this schema this will give you a competent architecture but probably not an optimal one.
But once this network is initialized, you can iteratively tune the configuration during training using a number of ancillary algorithms; one family of these works by pruning nodes based on (small) values of the weight vector after a certain number of training epochs--in other words, eliminating unnecessary/redundant nodes (more on this below).
So every NN has three types of layers: input, hidden, and output.
Creating the NN architecture therefore means coming up with values for the number of layers of each type and the number of nodes in each of these layers.
The Input Layer
Simple--every NN has exactly one of them--no exceptions that I'm aware of.
With respect to the number of neurons comprising this layer, this parameter is completely and uniquely determined once you know the shape of your training data. Specifically, the number of neurons comprising that layer is equal to the number of features (columns) in your data. Some NN configurations add one additional node for a bias term.
The Output Layer
Like the Input layer, every NN has exactly one output layer. Determining its size (number of neurons) is simple; it is completely determined by the chosen model configuration.
Is your NN going running in Machine Mode or Regression Mode (the ML convention of using a term that is also used in statistics but assigning a different meaning to it is very confusing). Machine mode: returns a class label (e.g., "Premium Account"/"Basic Account"). Regression Mode returns a value (e.g., price).
If the NN is a regressor, then the output layer has a single node.
If the NN is a classifier, then it also has a single node unless softmax is used in which case the output layer has one node per class label in your model.
The Hidden Layers
So those few rules set the number of layers and size (neurons/layer) for both the input and output layers. That leaves the hidden layers.
How many hidden layers? Well if your data is linearly separable (which you often know by the time you begin coding a NN) then you don't need any hidden layers at all. Of course, you don't need an NN to resolve your data either, but it will still do the job.
Beyond that, as you probably know, there's a mountain of commentary on the question of hidden layer configuration in NNs (see the insanely thorough and insightful NN FAQ for an excellent summary of that commentary). One issue within this subject on which there is a consensus is the performance difference from adding additional hidden layers: the situations in which performance improves with a second (or third, etc.) hidden layer are very few. One hidden layer is sufficient for the large majority of problems.
So what about size of the hidden layer(s)--how many neurons? There are some empirically-derived rules-of-thumb, of these, the most commonly relied on is 'the optimal size of the hidden layer is usually between the size of the input and size of the output layers'. Jeff Heaton, author of Introduction to Neural Networks in Java offers a few more.
In sum, for most problems, one could probably get decent performance (even without a second optimization step) by setting the hidden layer configuration using just two rules: (i) number of hidden layers equals one; and (ii) the number of neurons in that layer is the mean of the neurons in the input and output layers.
Optimization of the Network Configuration
Pruning describes a set of techniques to trim network size (by nodes not layers) to improve computational performance and sometimes resolution performance. The gist of these techniques is removing nodes from the network during training by identifying those nodes which, if removed from the network, would not noticeably affect network performance (i.e., resolution of the data). (Even without using a formal pruning technique, you can get a rough idea of which nodes are not important by looking at your weight matrix after training; look weights very close to zero--it's the nodes on either end of those weights that are often removed during pruning.) Obviously, if you use a pruning algorithm during training then begin with a network configuration that is more likely to have excess (i.e., 'prunable') nodes--in other words, when deciding on a network architecture, err on the side of more neurons, if you add a pruning step.
Put another way, by applying a pruning algorithm to your network during training, you can approach optimal network configuration; whether you can do that in a single "up-front" (such as a genetic-algorithm-based algorithm) I don't know, though I do know that for now, this two-step optimization is more common.
You state that for the majority of problems need only one hidden layer. Perhaps it is better to say that NNs with more hidden layers are extremly hard to train (if you want to know how, check the publications of Hinton's group at Uof Toronto, "deep learning") and thus those problems that require more than a hidden layer are considered "non solvable" by neural networks.
The same functions can be represented with exponentially less parameters, leading to better generalization.
You write *If the NN is a regressor, then the output layer has a single node.*. Why only a single node? Why can't I have multiple continuous outputs?
@gerrit You can definitely have multiple continuous outputs if your target output is vector-valued. Defining an appropriate loss function for vector-valued outputs can be a bit trickier than with one output though.
Where trickier means: you basically sum the individual errors. Yet, they might need to be weighted, otherwise one error dominates everything else (e.g. because it has a much larger range). ZScores on the training outputs is one obvious way to do this, but might not be what you want.
I thought it was the opposite than this: _If the NN is a classifier, then it also has a single node unless softmax is used in which case the output layer has one node per class label in your model._
You could construct a neural network with multiple input layers if you wanted to. The first layer would have to be an input layer, but successive layers could be comprised of both hidden neurons and input neurons, or, if you will, you can have an additional input layer on the same level as a hidden layer.
@doug Thank you for this wonderful answer. This allowed me to reduce my ANN from 3 hidden layers down to 1 and achieve the same classification accuracy by setting the right number of hidden neurons... I just used the average of the input and output summed together. Thanks!
@davips No, it is written correctly here. Softmax works similar to a sigmoid or other map to [-1,1] space, except that it maps to multiple dimensions/outputs, just as Doug mentions.
@MikeWilliamson I suspect its a terminology/use "issue" with a lot of the common ML frameworks. It seems common to have a bunch of "output" neurons, then you softmax them, and then you pick the highest one as the single output; and people incorrectly lump the "pick highest" into softmax.
"Simple--every NN has exactly one of them--no exceptions that I'm aware of." Not true, here's an example in the Keras documentation of two input (and two output) layers: https://keras.io/getting-started/functional-api-guide/#multi-input-and-multi-output-models
Does the answer change in any way if the MLP is for approximating the Q function in a RL task?
Do someone have a paper or a book, where the fact that "One hidden layer is sufficient for the large majority of problems." could be proved? Or cited? Thank you!
This paper argues that if you don't have any hidden layer wider than the input layer, you won't be able to form disconnected decision regions in the input space. So, I think it should be beneficial to experiment with that instead of just the average between the input and output layer. https://arxiv.org/abs/1803.00094
Thanks for this nice answer!! So let us consider the following example: Model1) 500 input units, 100 hidden units, 200 output units; Model2) 500 input units, 100 hidden units, 5 hidden units (second hidden layer), 200 output units. The benefit of Model A compared to Model B is: Faster training, lower generelization error (?) and the benefit of Model B compared to Model A: due to the high variance in the number of units, I'm a bit unsure how to put the effect into words. Can you think of an advantage of this model (compared to model A)?