### What are the advantages of ReLU over sigmoid function in deep neural networks?

The state of the art of non-linearity is to use rectified linear units (ReLU) instead of sigmoid function in deep neural network. What are the advantages?

I know that training a network when ReLU is used would be faster, and it is more biological inspired, what are the other advantages? (That is, any disadvantages of using sigmoid)?

DaemonMaker Correct answer

6 years agoTwo additional major benefits of ReLUs are sparsity and a reduced likelihood of vanishing gradient. But first recall the definition of a ReLU is $h = \max(0, a)$ where $a = Wx + b$.

One major benefit is the reduced likelihood of the gradient to vanish. This arises when $a > 0$. In this regime the gradient has a constant value. In contrast, the gradient of sigmoids becomes increasingly small as the absolute value of x increases. The constant gradient of ReLUs results in faster learning.

The other benefit of ReLUs is sparsity. Sparsity arises when $a \le 0$. The more such units that exist in a layer the more sparse the resulting representation. Sigmoids on the other hand are always likely to generate some non-zero value resulting in dense representations. Sparse representations seem to be more beneficial than dense representations.

When you say the gradient, you mean with respect to weights or the input x? @DaemonMaker

With respect to the weights. Gradient-based learning algorithms always taking the gradient with respect to the parameters of the learner, i.e. the weights and biases in a NN.

What do you mean by "dense" and "sparse" "representations" ? Query to google "sparse representation neural networks" doesn't seem to come up with anything relevant.

sparsity refers to several weights, or coefficients in regression, being equal to zero.

"Sparse representations seem to be more beneficial than dense representations." Could you provide a source or explanation?

The only paper that I can think of off the top of my head that discusses this would be Learning Deep Architectures for AI by Yoshua Bengio. See section 13.2. (https://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf) However I'm confident you'll find more if you look up the affects of L1-norms on the activations the layer of an NN.

I don't understand how this answer is at all correct. The "reduced likelihood of the gradient to vanish" leaves something to be desired. The ReLu is ZERO for sufficiently small $x$. During learning, you gradients WILL vanish for certain neurons when you're in this regime. In fact, it's clearly unavoidable, because otherwise your network will be linear. Batch normalization solves this mostly. This doesn't even mention the most important reason: ReLu's and their gradients. are extremely fast to compute, compared to a sigmoid.

Although perhaps I'm being too harsh, given that this answer was posted in 2014, and batch normalization came out the next year...

I'm new to the subject, but it's hard for me to see how ReLU would be appropriate if the goal is for the model to learn a continuous functional mapping, as opposed to, say, discrete classification. Wouldn't the result be that you're approximating a continuous function with a series of discrete line segments? This might be okay in some contexts, but in others, continuous differentials might be important.

Yes, that's right @GrantPetty. As long as the function of interest can be reasonably approximated by a piecewise linear function then ReLU will suffice. Where the function cannot be, or say the number of pieces is too great to achieve a reasonable approximation, then ReLU is not likely to be one's best choice.

to further add on what @DaemonMaker said: if you have lots of data as in Deep Learning then RELU is likely to give u better result (as RELU is a coarse-grain appoximation), but if your input is just a small number, then the fine-grain approximation via sigmoid may be better.

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM

Michael B 3 years ago

An extra piece of answer to complete on the **Sparse vs Dense performance debate**. Don't think about NN anymore, just think about linear algebra and matrix operations, because forward and backward propagations are a series of matrix operations. Now remember that there exist a lot of optimized operator to apply to sparse matrix and so optimizing those operations in our network could dramatically improve the performance of the algorithm. I hope that could help some of you guys...