What is the difference between "sort -u" and "sort | uniq"?

  • Everywhere I see someone needing to get a sorted, unique list, they always pipe to sort | uniq. I've never seen any examples where someone uses sort -u instead. Why not? What's the difference, and why is it better to use uniq than the unique flag to sort?

  • Chris Down

    Chris Down Correct answer

    7 years ago

    sort | uniq existed before sort -u, and is compatible with a wider range of systems, although almost all modern systems do support -u -- it's POSIX. It's mostly a throwback to the days when sort -u didn't exist (and people don't tend to change their methods if the way that they know continues to work, just look at ifconfig vs. ip adoption).

    The two were likely merged because removing duplicates within a file requires sorting (at least, in the standard case), and is an extremely common use case of sort. It is also faster internally as a result of being able to do both operations at the same time (and due to the fact that it doesn't require IPC between uniq and sort). Especially if the file is big, sort -u will likely use fewer intermediate files to sort the data.

    On my system I consistently get results like this:

    $ dd if=/dev/urandom of=/dev/shm/file bs=1M count=100
    100+0 records in
    100+0 records out
    104857600 bytes (105 MB) copied, 8.95208 s, 11.7 MB/s
    $ time sort -u /dev/shm/file >/dev/null
    
    real        0m0.500s
    user        0m0.767s
    sys         0m0.167s
    $ time sort /dev/shm/file | uniq >/dev/null
    
    real        0m0.772s
    user        0m1.137s
    sys         0m0.273s
    

    It also doesn't mask the return code of sort, which may be important (in modern shells there are ways to get this, for example, bash's $PIPESTATUS array, but this wasn't always true).

    I tend to use `sort | uniq` because 9 times out of 10, I'm actually piping to `uniq -c`.

    Note that `sort -u` was part of 7th Edition UNIX, circa 1979. Versions of `sort` without support for `-u` are truly archaic — or were written without attention to the de facto standard before POSIX's de jure standard. See also Stack Overflow Sort & uniq in Linux shell from 2010.

    +1 because of `ip`. It's 2016 and this post in 2013, but I only know about `ip` command now.

    +1 for "9 times out 10 I'm actually piping to `uniq -c` " (and maybe piping once more to `sort -nr | head` ). I was wondering what is the equivalent to `sort | uniq` in Vim when I found out that Vim has `:sort u` command. And TIL `sort -u` exists as well.

    Note that there is a difference when using `sort -n | uniq` vs. `sort -n -u`. For example trailing and leading whitespaces will be seen as duplicates by `sort -n -u` but not by the former! `echo -e 'test \n test' | sort -n -u` returns `test`, but `echo -e 'test \n test' | sort -n | uniq` returns both lines.

    Another problem with `sort -n -u` becomes apparent with this `echo -e '14a-foo\n14b-bar\n15' | sort -n -u` ... i.e. the `14b-bar` will be deleted! Not sure if this is a bug or not, though. This does not happen with with `sort -n | uniq`. Imo you should never use `sort -n -u`, it only leads to trouble.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM