What does the hidden layer in a neural network compute?
I'm sure many people will respond with links to 'let me google that for you', so I want to say that I've tried to figure this out so please forgive my lack of understanding here, but I cannot figure out how the practical implementation of a neural network actually works.
I understand the input layer and how to normalize the data, I also understand the bias unit, but when it comes to the hidden layer, what the actual computation is in that layer, and how it maps to the output is just a little foggy. I've seen diagrams with question marks in the hidden layer, boolean functions like AND/OR/XOR, activation functions, and input nodes that map to all of the hidden units and input nodes that map to only a few hidden units each and so I just have a few questions on the practical aspect. Of course, a simple explanation of the entire neural network process like you would explain to a child, would be awesome.
What computations are done in the hidden layer?
How are those computations mapped to the output layer?
How does the ouput layer work? De-normalizing the data from the hidden layer?
Why are some layers in the input layer connected to the hidden layer and some are not?
People around here are nice, I have never seen a “let me google that for you” answer but many surprisingly thorough and insightful answers to what seemed at first to be basic questions. Unfortunately, I can't help you with yours but it seems quite relevant so I am happily voting it up.
Thanks for the comment and the vote Gael, I'm probably a bit jaded by the SO community as we all know how those folks can get :) Glad to see more of a spirit of collaboration over here as opposed to trying to earn badges and points by editing/closing questions.
I am not expert in neural networks specifically, although I do get involved in their applications and methods. My maybe-not-so-helpful answer would be that the specific computations in the hidden depend on the 'cost function' that you are imposing on your ouput, i.e., what you try achieve. For example, if you want to group the input elements into clustered sets, you will compute distances between elements in the hidden layer. This may go through various iterations and optimization cycles within this layer, until you meet an error criterion that allows the process to `leave' this layer.
Three sentence version:
Each layer can apply any function you want to the previous layer (usually a linear transformation followed by a squashing nonlinearity).
The hidden layers' job is to transform the inputs into something that the output layer can use.
The output layer transforms the hidden layer activations into whatever scale you wanted your output to be on.
Like you're 5:
If you want a computer to tell you if there's a bus in a picture, the computer might have an easier time if it had the right tools.
So your bus detector might be made of a wheel detector (to help tell you it's a vehicle) and a box detector (since the bus is shaped like a big box) and a size detector (to tell you it's too big to be a car). These are the three elements of your hidden layer: they're not part of the raw image, they're tools you designed to help you identify busses.
If all three of those detectors turn on (or perhaps if they're especially active), then there's a good chance you have a bus in front of you.
Neural nets are useful because there are good tools (like backpropagation) for building lots of detectors and putting them together.
Like you're an adult
A feed-forward neural network applies a series of functions to the data. The exact functions will depend on the neural network you're using: most frequently, these functions each compute a linear transformation of the previous layer, followed by a squashing nonlinearity. Sometimes the functions will do something else (like computing logical functions in your examples, or averaging over adjacent pixels in an image). So the roles of the different layers could depend on what functions are being computed, but I'll try to be very general.
Let's call the input vector $x$, the hidden layer activations $h$, and the output activation $y$. You have some function $f$ that maps from $x$ to $h$ and another function $g$ that maps from $h$ to $y$.
So the hidden layer's activation is $f(x)$ and the output of the network is $g(f(x))$.
Why have two functions ($f$ and $g$) instead of just one?
If the level of complexity per function is limited, then $g(f(x))$ can compute things that $f$ and $g$ can't do individually.
An example with logical functions:
For example, if we only allow $f$ and $g$ to be simple logical operators like "AND", "OR", and "NAND", then you can't compute other functions like "XOR" with just one of them. On the other hand, we could compute "XOR" if we were willing to layer these functions on top of each other:
First layer functions:
- Make sure that at least one element is "TRUE" (using OR)
- Make sure that they're not all "TRUE" (using NAND)
Second layer function:
- Make sure that both of the first-layer criteria are satisfied (using AND)
The network's output is just the result of this second function. The first layer transforms the inputs into something that the second layer can use so that the whole network can perform XOR.
An example with images:
The first layer looks for short pieces of edges in the image: these are very easy to find from raw pixel data, but they're not very useful by themselves for telling you if you're looking at a face or a bus or an elephant.
The next layer composes the edges: if the edges from the bottom hidden layer fit together in a certain way, then one of the eye-detectors in the middle of left-most column might turn on. It would be hard to make a single layer that was so good at finding something so specific from the raw pixels: eye detectors are much easier to build out of edge detectors than out of raw pixels.
The next layer up composes the eye detectors and the nose detectors into faces. In other words, these will light up when the eye detectors and nose detectors from the previous layer turn on with the right patterns. These are very good at looking for particular kinds of faces: if one or more of them lights up, then your output layer should report that a face is present.
This is useful because face detectors are easy to build out of eye detectors and nose detectors, but really hard to build out of pixel intensities.
So each layer gets you farther and farther from the raw pixels and closer to your ultimate goal (e.g. face detection or bus detection).
Answers to assorted other questions
"Why are some layers in the input layer connected to the hidden layer and some are not?"
The disconnected nodes in the network are called "bias" nodes. There's a really nice explanation here. The short answer is that they're like intercept terms in regression.
"Where do the "eye detector" pictures in the image example come from?"
I haven't double-checked the specific images I linked to, but in general, these visualizations show the set of pixels in the input layer that maximize the activity of the corresponding neuron. So if we think of the neuron as an eye detector, this is the image that the neuron considers to be most eye-like. Folks usually find these pixel sets with an optimization (hill-climbing) procedure.
In this paper by some Google folks with one of the world's largest neural nets, they show a "face detector" neuron and a "cat detector" neuron this way, as well as a second way: They also show the actual images that activate the neuron most strongly (figure 3, figure 16). The second approach is nice because it shows how flexible and nonlinear the network is--these high-level "detectors" are sensitive to all these images, even though they don't particularly look similar at the pixel level.
Let me know if anything here is unclear or if you have any more questions.
So is there just one defined algorithm for every single node on a given layer and the weights are what make the outputs different? Or can you program every node on the layer to be different?
@GeorgeMcDowd Yes, usually the weights are the only differences between nodes within a layer. There's no particular reason (apart from computational efficiency) that it has to be this way, though.
Thanks for the answers David, I really appreciate your post. So I get the fact that each neuron uses the weights that eventually go into an activation function that allows it to fire a binary value and that these binary values computed together can produce more complex output that a single node could not compute. Slightly still confused on how when a program is looking at Pixel #1 that it can say "Oh yes, this is a bus" and fire a 1 and then adjust the weight for that output.
@GeorgeMcDowd this gets at the key issue: looking at pixels and identifying busses is hard, as you suggested. Fortunately, looking at pixels and finding edges is easy--that's all the first hidden layer tries to do. The next layer tries to make inferences based on edges, which is much easier than trying to do so based on pixels.
SO should give you some other reward (than just points) for the time and effort you put into this answer!
Thank you for such a great answer. You mentioned `usually a linear transformation followed by a squashing nonlinearity`. I was wondering what are the merits of this approach? Is it used primarily to bound the output values so the outputs are not all over the place? Would an output "filter" of y= 1 for y > 1; y = -1 for y < -1 be comparable?
@JoshuaEnfield I think the logic in the 1980's was a combination of being similar to how people thought the brain worked, that it was differentiable everywhere, and that the values were bounded like you mentioned. Since then, people have found that `f(x) = max(x, 0)` (the "rectified linear unit") often works better, even though it doesn't have many of those properties.
So, it would it be correct to say that each node in the hidden layer corresponds to a small hypothesis, and the overall goal of the network is to weigh the correctness of the hypothesis? For example, from you ELI5 example, you chose 3 hypotheses: a bus has wheels, a bus is a square shape, and a bus is larger than a car. These are assumptions that you made and not something the network would/could/should figure out for you? And if you wanted to make your bus detector better you'd come up with other hypothesis to feed into the network, such as "a bus is red", "a bus has a sign on the front" etc?
Wow, this was such an incredible answer. Your description of using hidden layers to build the bus was just what I needed to get over my hurdle of understanding the purpose and power of multi-layer networks. Thank you!!