### What is the difference between linear regression and logistic regression?

• What is the difference between linear regression and logistic regression?

When would you use each?

In the linear regression model the dependent variable $y$ is considered continuous, whereas in logistic regression it is categorical, i.e., discrete. In application, the former is used in regression settings while the latter is used for binary classification or multi-class classification (where it is called multinomial logistic regression).

Although written in a different context, it may help you to read my answer here: Difference between logit and probit models, which contains a lot of information about what's happening in logistic regression that may help you understand these better.

All the previous answers are right, but there are reasons you might favor a linear regression model even when your outcome is a dichotomy. I've written about these reasons here: http://statisticalhorizons.com/linear-vs-logistic

• DocBuckets Correct answer

8 years ago

Linear regression uses the general linear equation $Y=b_0+∑(b_i X_i)+\epsilon$ where $Y$ is a continuous dependent variable and independent variables $X_i$ are usually continuous (but can also be binary, e.g. when the linear model is used in a t-test) or other discrete domains. $\epsilon$ is a term for the variance that is not explained by the model and is usually just called "error". Individual dependent values denoted by $Y_j$ can be solved by modifying the equation a little: $Y_j=b_0 + \sum{(b_i X_{ij})+\epsilon_j}$

Logistic regression is another generalized linear model (GLM) procedure using the same basic formula, but instead of the continuous $Y$, it is regressing for the probability of a categorical outcome. In simplest form, this means that we're considering just one outcome variable and two states of that variable- either 0 or 1.

The equation for the probability of $Y=1$ looks like this: $$P(Y=1) = {1 \over 1+e^{-(b_0+\sum{(b_iX_i)})}}$$

Your independent variables $X_i$ can be continuous or binary. The regression coefficients $b_i$ can be exponentiated to give you the change in odds of $Y$ per change in $X_i$, i.e., $Odds={P(Y=1) \over P(Y=0)}={P(Y=1) \over 1-P(Y=1)}$ and ${\Delta Odds}= e^{b_i}$. $\Delta Odds$ is called the odds ratio, $Odds(X_i+1)\over Odds(X_i)$. In English, you can say that the odds of $Y=1$ increase by a factor of $e^{b_i}$ per unit change in $X_i$.

Example: If you wanted to see how body mass index predicts blood cholesterol (a continuous measure), you'd use linear regression as described at the top of my answer. If you wanted to see how BMI predicts the odds of being a diabetic (a binary diagnosis), you'd use logistic regression.

This looks like a fine answer, but could you explain what the $\epsilon_i$ stand for and--in particular--why you include them *within* the summations? (What is being summed over, anyway?)

It looks to me Bill that he meant to write i.e. (Latin abbreviation for that is) rather than e. i.

But the εi in the summation of the exponent shouldn't be there. It looks like the noise term in the model was accidentally carried there. The only summing should be over the bis that represent the p coefficients for the p covariates.

Oops, you're right. Fixing it now.

There's an error in your expression for $P(Y=1)$. You should have $$P(Y=1) = \frac{1}{1 + \exp \{-X \boldsymbol{\beta} \} },$$ not $$P(Y=1) = \frac{1}{1 + \exp \{ -(X \boldsymbol{\beta}+\varepsilon) \} }$$ The randomness in a logistic regression model comes from the fact that these are bernoulli trials, not from there being errors in the success probabilities (which is how you're written it).

For logistic regression, shouldn't the GLM stand for generalized linear model, not general?

its unclear to me why logistic regression is necessarily binary while linear is continuous. Any explanation?

@samthebrand logistic regression is not binary per se. It can be used to model data with a binary response via probabilities that range between 0 and 1. Going to shamelessly plug my blog post on this which should clear your confusion.

• Linear Regression is used to establish a relationship between Dependent and Independent variables, which is useful in estimating the resultant dependent variable in case independent variable change. For example:

Using a Linear Regression, the relationship between Rain (R) and Umbrella Sales (U) is found to be - U = 2R + 5000

This equation says that for every 1mm of Rain, there is a demand for 5002 umbrellas. So, using Simple Regression, you can estimate the value of your variable.

Logistic Regression on the other hand is used to ascertain the probability of an event. And this event is captured in binary format, i.e. 0 or 1.

Example - I want to ascertain if a customer will buy my product or not. For this, I would run a Logistic Regression on the (relevant) data and my dependent variable would be a binary variable (1=Yes; 0=No).

In terms of graphical representation, Linear Regression gives a linear line as an output, once the values are plotted on the graph. Whereas, the logistic regression gives an S-shaped line

Reference from Mohit Khurana.

Re: "Linear Regression is used to establish a relationship between Dependent and Indipendent variables" - this is also true about logistic regression - it's just that the dependent variable is binary.

Logistic Regression isn't only for predicting a binary event ($2$ classes). It can be generalized to $k$ classes (multinomial logistic regression)

• The differences have been settled by DocBuckets and Pardis, but I want to add one way to compare their performance not mentioned.

Linear regression is usually solved by minimizing the least squares error of the model to the data, therefore large errors are penalized quadratically. Logistic regression is just the opposite. Using the logistic loss function causes large errors to be penalized to an asymptotically constant.

Consider linear regression on a categorical {0,1} outcomes to see why this is a problem. If your model predicts the outcome is 38 when truth is 1, you've lost nothing. Linear regression would try to reduce that 38, logistic wouldn't (as much).

Wre then, the situations/cases that _are_ penalized in a logistic, i.e., in what cases would we have a poor fit?

Just the opposite: whenever larger deviations from the fit actually do incur worse results. For instance, logistic regression is good at keeping you at hitting a dart board, but can't make a bullseye look nice. Or, similarly, thinks that a near miss of the board is the same as sticking your neighbor.

Great answer. Was there any research done how much does it hurt model's performance? I mean if a linear regression was used to predict response={0,1} instead of a logistic regression.

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM
• {{ error }}