What happens if the explanatory and response variables are sorted independently before regression?

  • Suppose we have data set $(X_i,Y_i)$ with $n$ points. We want to perform a linear regression, but first we sort the $X_i$ values and the $Y_i$ values independently of each other, forming data set $(X_i,Y_j)$. Is there any meaningful interpretation of the regression on the new data set? Does this have a name?

    I imagine this is a silly question so I apologize, I'm not formally trained in statistics. In my mind this completely destroys our data and the regression is meaningless. But my manager says he gets "better regressions most of the time" when he does this (here "better" means more predictive). I have a feeling he is deceiving himself.

    EDIT: Thank you for all of your nice and patient examples. I showed him the examples by @RUser4512 and @gung and he remains staunch. He's becoming irritated and I'm becoming exhausted. I feel crestfallen. I will probably begin looking for other jobs soon.

    *But my manager says he gets "better regressions most of the time" when he does this.* Oh god...

    I'm having a hard time convincing him. I drew a picture showing how the regression line is completely different. But he seems to like results he sees. I'm trying to tell him it is a coincidence. FML

    I'm really so embarrassed that I even asked this but I can't seem to convince him with counterexamples or math of any kind. He "has an intuition" that he can do this with his particular data set.

    There's certainly no reason for *you* to feel embarrassed!

    "Science is whatever we want it to be." - Dr. Leo Spaceman.

    If the regression is being used to predict on new data, it's easy to see by holding out a test set that this will make the regression much *less* predictive – I don't have time to construct an example right now, but that may be more convincing.

    @Dougal, I essentially do that below. (Nb, I wasn't sure exactly what would be the "correct" k-fold CV for this case, so I just used completely new data from the same DGP.)

    In addition to excellent points already made: If this is such a good idea, why isn't it in courses and texts?

    @NickCox because nobody dared to point this brilliant idea :)

    @Tim I am being partly frivolous and I imagine you are too. But results from this method wouldn't be replicable unless it was explained. People would assume that the advocate was incompetent or a cheat. Actually, that's not ruled out here either.

    Who in the world do you work for?

    This idea has to compete with another I have encountered: If your sample is small, just bulk it up with several copies of the same data.

    @dsaxton We're all curious, but this is one case where the anonymity of the OP is likely to be crucial.

    You should tell your boss you have a better idea. Instead of using the actual data just generate your own because it'll be easier to model.

    @gung Oops, I skimmed your answer and didn't notice the predictive error histograms. :)

    The manager should try nonparametric stats with the approach and see if the results "improve" even more (edit: intense sarcasm implied).

    A very simple counter example (beyond the randomized set) would be a data set where X_i = -k Y_i. Sorting the values would result in X_i = k Y_i which is completely incorrect

    Actually I can conceive some situations where this *might* do reasonably well -- e.g. when there are unmodelled predictor variables of just the right sort (however, I seriously doubt this will be the case). There may be some traction with your boss in investigating the out-of-sample properties of this approach. For example, how does it perform (compared to ordinary regression) when you do cross-validation?

    "I will probably begin looking for other jobs soon." you should look for other jobs now!

    That's not a regression, it's a Q-Q plot :P

    How is it that people who are clearly incompetent end up being employed and in charge? What is their secret?

    This is a great question because it really gets to how to convince somebody of something when they don't fully understand what is going on. I am not convinced that the manager will be convinced by pictures or notation (I figure his counterarguments would always be "but why _can't_ you make X and Y independent?"). I'd almost go so far as thinking appeal to (technical) authority would be appropriate here (the expert, the one doing the work, has more experience with these numbers than the manager).

    Hi @arbitraryuser. Great question with many good answers. Your edit is telling re: your manager becoming frustrated. You might want to see our sister site Workplace.SE about approaches to convince your boss on your point.

    People like to deceive themselves, and often become irritable when that deception is noted. A really important skill to learn in your career is how to gently counter that deception (try channeling the best elementary school teacher you every knew). Another important skill is identify when that deception is unshakable and avoiding those situations...

    I am too lazy myself to do it, but `R` has a repository of data sets that I think could make this point much stronger: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

    Instead of sorting the X values, why not use two copies of the Y values? That is, instead of using use ! Guaranteed to get a high R^2 value or your money back!

    you should also sort the bits in each Xi and Yi value for even "better regressions".

    You should quit your job immediately. Your company is probably doomed.

    The problem is that it finds correlation in the artificially sorted data, not the actual data. It can't predict the next value y_i given x_i; all it predicts is the y_i corresponding to x_i _after reordering_. This is totally pointless. The points in your boss's graph don't correspond to actual data points.

    There must be a Dilbert strip about your manager. Find it, print it, and leave it on your desk the you leave. P.S. http://www.de.ufpe.br/~cribari/dilbert_2.gif

    @NickCox I agree with the sentiment that there's sense in protecting the OP's anonymity, but I think it would be doing humanity an enormous service for this manager to be unmasked and publicly shamed. Public shaming will be much more likely to convince him he's wrong than simulations in R. I think it's important that the world knows never to put this manager in charge of anything involving numbers ever again.

    @DavidM.Perlman In turn I agree with the sentiment. But pick your favourite case where you think the people you disagree with are just wrong, period, no discussion necessary, e.g. the opposite site from you on global warming, immigration, whatever. Public criticism just entrenches attitudes. This person has already demonstrated immunity to statistical reasoning.

    "_sigh_ ... because that's not the data."

    It's surprising that this question has been asked. It's so obvious that the resultant data would be meaningless!

  • I'm not sure what your boss thinks "more predictive" means. Many people incorrectly believe that lower $p$-values mean a better / more predictive model. That is not necessarily true (this being a case in point). However, independently sorting both variables beforehand will guarantee a lower $p$-value. On the other hand, we can assess the predictive accuracy of a model by comparing its predictions to new data that were generated by the same process. I do that below in a simple example (coded with R).

    options(digits=3)                       # for cleaner output
    set.seed(9149)                          # this makes the example exactly reproducible
    
    B1 = .3
    N  = 50                                 # 50 data
    x  = rnorm(N, mean=0, sd=1)             # standard normal X
    y  = 0 + B1*x + rnorm(N, mean=0, sd=1)  # cor(x, y) = .31
    sx = sort(x)                            # sorted independently
    sy = sort(y)
    cor(x,y)    # [1] 0.309
    cor(sx,sy)  # [1] 0.993
    
    model.u = lm(y~x)
    model.s = lm(sy~sx)
    summary(model.u)$coefficients
    #             Estimate Std. Error t value Pr(>|t|)
    # (Intercept)    0.021      0.139   0.151    0.881
    # x              0.340      0.151   2.251    0.029  # significant
    summary(model.s)$coefficients
    #             Estimate Std. Error t value Pr(>|t|)
    # (Intercept)    0.162     0.0168    9.68 7.37e-13
    # sx             1.094     0.0183   59.86 9.31e-47  # wildly significant
    
    u.error = vector(length=N)              # these will hold the output
    s.error = vector(length=N)
    for(i in 1:N){
      new.x      = rnorm(1, mean=0, sd=1)   # data generated in exactly the same way
      new.y      = 0 + B1*x + rnorm(N, mean=0, sd=1)
      pred.u     = predict(model.u, newdata=data.frame(x=new.x))
      pred.s     = predict(model.s, newdata=data.frame(x=new.x))
      u.error[i] = abs(pred.u-new.y)        # these are the absolute values of
      s.error[i] = abs(pred.s-new.y)        #  the predictive errors
    };  rm(i, new.x, new.y, pred.u, pred.s)
    u.s = u.error-s.error                   # negative values means the original
                                            # yielded more accurate predictions
    mean(u.error)  # [1] 1.1
    mean(s.error)  # [1] 1.98
    mean(u.s<0)    # [1] 0.68
    
    
    windows()
      layout(matrix(1:4, nrow=2, byrow=TRUE))
      plot(x, y,   main="Original data")
      abline(model.u, col="blue")
      plot(sx, sy, main="Sorted data")
      abline(model.s, col="red")
      h.u = hist(u.error, breaks=10, plot=FALSE)
      h.s = hist(s.error, breaks=9,  plot=FALSE)
      plot(h.u, xlim=c(0,5), ylim=c(0,11), main="Histogram of prediction errors",
           xlab="Magnitude of prediction error", col=rgb(0,0,1,1/2))
      plot(h.s, col=rgb(1,0,0,1/4), add=TRUE)
      legend("topright", legend=c("original","sorted"), pch=15, 
             col=c(rgb(0,0,1,1/2),rgb(1,0,0,1/4)))
      dotchart(u.s, color=ifelse(u.s<0, "blue", "red"), lcolor="white",
               main="Difference between predictive errors")
      abline(v=0, col="gray")
      legend("topright", legend=c("u better", "s better"), pch=1, col=c("blue","red"))
    

    enter image description here

    The upper left plot shows the original data. There is some relationship between $x$ and $y$ (viz., the correlation is about $.31$.) The upper right plot shows what the data look like after independently sorting both variables. You can easily see that the strength of the correlation has increased substantially (it is now about $.99$). However, in the lower plots, we see that the distribution of predictive errors is much closer to $0$ for the model trained on the original (unsorted) data. The mean absolute predictive error for the model that used the original data is $1.1$, whereas the mean absolute predictive error for the model trained on the sorted data is $1.98$—nearly twice as large. That means the sorted data model's predictions are much further from the correct values. The plot in the lower right quadrant is a dot plot. It displays the differences between the predictive error with the original data and with the sorted data. This lets you compare the two corresponding predictions for each new observation simulated. Blue dots to the left are times when the original data were closer to the new $y$-value, and red dots to the right are times when the sorted data yielded better predictions. There were more accurate predictions from the model trained on the original data $68\%$ of the time.


    The degree to which sorting will cause these problems is a function of the linear relationship that exists in your data. If the correlation between $x$ and $y$ were $1.0$ already, sorting would have no effect and thus not be detrimental. On the other hand, if the correlation were $-1.0$, the sorting would completely reverse the relationship, making the model as inaccurate as possible. If the data were completely uncorrelated originally, the sorting would have an intermediate, but still quite large, deleterious effect on the resulting model's predictive accuracy. Since you mention that your data are typically correlated, I suspect that has provided some protection against the harms intrinsic to this procedure. Nonetheless, sorting first is definitely harmful. To explore these possibilities, we can simply re-run the above code with different values for B1 (using the same seed for reproducibility) and examine the output:

    1. B1 = -5:

      cor(x,y)                            # [1] -0.978
      summary(model.u)$coefficients[2,4]  # [1]  1.6e-34  # (i.e., the p-value)
      summary(model.s)$coefficients[2,4]  # [1]  1.82e-42
      mean(u.error)                       # [1]  7.27
      mean(s.error)                       # [1] 15.4
      mean(u.s<0)                         # [1]  0.98
      
    2. B1 = 0:

      cor(x,y)                            # [1] 0.0385
      summary(model.u)$coefficients[2,4]  # [1] 0.791
      summary(model.s)$coefficients[2,4]  # [1] 4.42e-36
      mean(u.error)                       # [1] 0.908
      mean(s.error)                       # [1] 2.12
      mean(u.s<0)                         # [1] 0.82
      
    3. B1 = 5:

      cor(x,y)                            # [1] 0.979
      summary(model.u)$coefficients[2,4]  # [1] 7.62e-35
      summary(model.s)$coefficients[2,4]  # [1] 3e-49
      mean(u.error)                       # [1] 7.55
      mean(s.error)                       # [1] 6.33
      mean(u.s<0)                         # [1] 0.44
      

    Your answer makes a very good point, but perhaps not as clearly as it could and should. It's not necessarily obvious to a layperson (like, say, the OP's manager) what all those plots at the end (never mind the R code) actually show and imply. IMO, your answer could really use an explanatory paragraph or two.

    Thanks for your comment, @IlmariKaronen. Can you suggest things to add? I tried to make the code as self-explanatory as possible, & commented it extensively. But I may no longer be able to see these things with the eyes of someone who isn't familiar w/ these topics. I will add some text to describe the plots at the bottom. If you can think of anything else, please let me know.

    +1 This still is the sole answer that addresses the situation proposed: when two variables *already exhibit some positive association,* it nevertheless is an error to regress the independently sorted values. All the other answers assume there is no association or that it is actually negative. Although they are good examples, since they don't apply they won't be convincing. What we still lack, though, is a *gut-level intuitive real-world example* of data like those simulated here where the nature of the mistake is embarrassingly obvious.

    +1 for not being swayed by orthodoxy and using "=" for assignment in R.

    @dsaxton, I use `<-` sometimes, but my goal on CV is to write R code as close to pseudocode as possible so that it is more readable for people who aren't familiar w/ R. `=` is pretty universal among programming languages as an assignment operator.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM