Looping through files with spaces in the names?

  • I wrote the following script to diff the outputs of two directores with all the same files in them as such:

    #!/bin/bash
    
    for file in `find . -name "*.csv"`  
    do
         echo "file = $file";
         diff $file /some/other/path/$file;
         read char;
    done
    

    I know there are other ways to achieve this. Curiously though, this script fails when the files have spaces in them. How can I deal with this?

    Example output of find:

    ./zQuery - abc - Do Not Prompt for Date.csv
    

    I disagree with that this would be a duplicate. The accepted answer answers how to loop over filenames with spaces; that has nothing to do with "why is looping over find's output bad practise". I found this question (not the other) because I need to loop over filenames with spaces, as in: for file in $LIST_OF_FILES; do ... where $LIST_OF_FILES is not the output of find; it's just a list of filenames (separated by newlines).

    @CarloWood - file names can include newlines, so your question is rather unique: looping over a list of filenames that can contain spaces but not newlines. I think you're going to have to use the IFS technique, to indicate that the break occurs at '\n'

    @Diagon- woah, I never realized that file names are allowed to contain newlines. I use mostly (only) linux/UNIX and there even spaces are rare; I certainly never in my entire life saw newlines being used :p. They might as well forbid that imho.

    @CarloWood - filenames end in a null ('\0', same as ''). Anything else is acceptable.

  • Mikel

    Mikel Correct answer

    10 years ago

    Short answer (closest to your answer, but handles spaces)

    OIFS="$IFS"
    IFS=$'\n'
    for file in `find . -type f -name "*.csv"`  
    do
         echo "file = $file"
         diff "$file" "/some/other/path/$file"
         read line
    done
    IFS="$OIFS"
    

    Better answer (also handles wildcards and newlines in file names)

    find . -type f -name "*.csv" -print0 | while IFS= read -r -d '' file; do
        echo "file = $file"
        diff "$file" "/some/other/path/$file"
        read line </dev/tty
    done
    

    Best answer (based on Gilles' answer)

    find . -type f -name '*.csv' -exec sh -c '
      file="$0"
      echo "$file"
      diff "$file" "/some/other/path/$file"
      read line </dev/tty
    ' {} ';'
    

    Or even better, to avoid running one sh per file:

    find . -type f -name '*.csv' -exec sh -c '
      for file do
        echo "$file"
        diff "$file" "/some/other/path/$file"
        read line </dev/tty
      done
    ' sh {} +
    

    Long answer

    You have three problems:

    1. By default, the shell splits the output of a command on spaces, tabs, and newlines
    2. Filenames could contain wildcard characters which would get expanded
    3. What if there is a directory whose name ends in *.csv?

    1. Splitting only on newlines

    To figure out what to set file to, the shell has to take the output of find and interpret it somehow, otherwise file would just be the entire output of find.

    The shell reads the IFS variable, which is which is set to <space><tab><newline> by default.

    Then it looks at each character in the output of find. As soon as it sees any character that's in IFS, it thinks that marks the end of the file name, so it sets file to whatever characters it saw until now and runs the loop. Then it starts where it left off to get the next file name, and runs the next loop, etc., until it reaches the end of output.

    So it's effectively doing this:

    for file in "zquery" "-" "abc" ...
    

    To tell it to only split the input on newlines, you need to do

    IFS=$'\n'
    

    before your for ... find command.

    That sets IFS to a single newline, so it only splits on newlines, and not spaces and tabs as well.

    If you are using sh or dash instead of ksh93, bash or zsh, you need to write IFS=$'\n' like this instead:

    IFS='
    '
    

    That is probably enough to get your script working, but if you're interested to handle some other corner cases properly, read on...

    2. Expanding $file without wildcards

    Inside the loop where you do

    diff $file /some/other/path/$file
    

    the shell tries to expand $file (again!).

    It could contain spaces, but since we already set IFS above, that won't be a problem here.

    But it could also contain wildcard characters such as * or ?, which would lead to unpredictable behavior. (Thanks to Gilles for pointing this out.)

    To tell the shell not to expand wildcard characters, put the variable inside double quotes, e.g.

    diff "$file" "/some/other/path/$file"
    

    The same problem could also bite us in

    for file in `find . -name "*.csv"`
    

    For example, if you had these three files

    file1.csv
    file2.csv
    *.csv
    

    (very unlikely, but still possible)

    It would be as if you had run

    for file in file1.csv file2.csv *.csv
    

    which will get expanded to

    for file in file1.csv file2.csv *.csv file1.csv file2.csv
    

    causing file1.csv and file2.csv to be processed twice.

    Instead, we have to do

    find . -name "*.csv" -print | while IFS= read -r file; do
        echo "file = $file"
        diff "$file" "/some/other/path/$file"
        read line </dev/tty
    done
    

    read reads lines from standard input, splits the line into words according to IFS and stores them in the variable names that you specify.

    Here, we're telling it not to split the line into words, and to store the line in $file.

    Also note that read line has changed to read line </dev/tty.

    This is because inside the loop, standard input is coming from find via the pipeline.

    If we just did read, it would be consuming part or all of a file name, and some files would be skipped.

    /dev/tty is the terminal where the user is running the script from. Note that this will cause an error if the script is run via cron, but I assume this is not important in this case.

    Then, what if a file name contains newlines?

    We can handle that by changing -print to -print0 and using read -d '' on the end of a pipeline:

    find . -name "*.csv" -print0 | while IFS= read -r -d '' file; do
        echo "file = $file"
        diff "$file" "/some/other/path/$file"
        read char </dev/tty
    done
    

    This makes find put a null byte at the end of each file name. Null bytes are the only characters not allowed in file names, so this should handle all possible file names, no matter how weird.

    To get the file name on the other side, we use IFS= read -r -d ''.

    Where we used read above, we used the default line delimiter of newline, but now, find is using null as the line delimiter. In bash, you can't pass a NUL character in an argument to a command (even builtin ones), but bash understands -d '' as meaning NUL delimited. So we use -d '' to make read use the same line delimiter as find. Note that -d $'\0', incidentally, works as well, because bash not supporting NUL bytes treats it as the empty string.

    To be correct, we also add -r, which says don't handle backslashes in file names specially. For example, without -r, \<newline> are removed, and \n is converted into n.

    A more portable way of writing this that doesn't require bash or zsh or remembering all the above rules about null bytes (again, thanks to Gilles):

    find . -name '*.csv' -exec sh -c '
      file="$0"
      echo "$file"
      diff "$file" "/some/other/path/$file"
      read char </dev/tty
    ' {} ';'
    

    3. Skipping directories whose names end in *.csv

    find . -name "*.csv"
    

    will also match directories that are called something.csv.

    To avoid this, add -type f to the find command.

    find . -type f -name '*.csv' -exec sh -c '
      file="$0"
      echo "$file"
      diff "$file" "/some/other/path/$file"
      read line </dev/tty
    ' {} ';'
    

    As glenn jackman points out, in both of these examples, the commands to execute for each file are being run in a subshell, so if you change any variables inside the loop, they will be forgotten.

    If you need to set variables and have them still set at the end of the loop, you can rewrite it to use process substitution like this:

    i=0
    while IFS= read -r -d '' file; do
        echo "file = $file"
        diff "$file" "/some/other/path/$file"
        read line </dev/tty
        i=$((i+1))
    done < <(find . -type f -name '*.csv' -print0)
    echo "$i files processed"
    

    Note that if you try copying and pasting this at the command line, read line will consume the echo "$i files processed", so that command won't get run.

    To avoid this, you could remove read line </dev/tty and send the result to a pager like less.


    NOTES

    I removed the semi-colons (;) inside the loop. You can put them back if you want, but they are not needed.

    These days, $(command) is more common than `command`. This is mainly because it's easier to write $(command1 $(command2)) than `command1 \`command2\``.

    read char doesn't really read a character. It reads a whole line so I changed it to read line.

    putting `while` in a pipeline can create issues with the subshell created (variables in the loop block not visible after the command completes for example). With bash, I would use input redirection and process substitution: `while read -r -d $'\0' file; do ...; done < <(find ... -print0)`

    Sure, or using a heredoc: `while read; do; done <

    @glenn jackman: I tried to add more explanation just now. Did I just make it better or worse?

    You don't need `IFS, -print0, while` and `read` if you handle `find` to its full, as shown below in my solution.

    Your first solution will cope with any character except newline if you also turn off globbing with `set -f`.

    Yes, but then we'd have to restore it at the end of the loop. The first solution was intended to be simple, so I'm reluctant to change it. Now you made this comment, at least it's on record. Thanks. :-)

    tldr; `IFS=$'\n'`

    Thank you very much for `IFS=$'\n'` - this was crazy, handling a single file list (from file) with spaces in filenames in for/while was nearly impossible without it...

    the "best" answer is relative, and i would say whatever is most understandable/maintainable by the scripter. for me, that is a slight modification to the first one. rather than saving/restoring IFS, you can use a subshell: ``(IFS=$'\n'; for file in ... )``

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM