What makes grep consider a file to be binary?

  • I have some database dumps from a Windows system on my box. They are text files. I'm using cygwin to grep through them. These appear to be plain text files; I open them with text editors such as notepad and wordpad and they look legible. However, when I run grep on them, it will say binary file foo.txt matches.

    I have noticed that the files contain some ascii NUL characters, which I believe are artifacts from the database dump.

    So what makes grep consider these files to be binary? The NUL character? Is there a flag on the filesystem? What do I need to change to get grep to show me the line matches?

    `--null-data` may be useful if `NUL` is the delimiter.

  • bbaja42

    bbaja42 Correct answer

    9 years ago

    If there is a NUL character anywhere in the file, grep will consider it as a binary file.

    There might a workaround like this cat file | tr -d '\000' | yourgrep to eliminate all null first, and then to search through file.

    ... or use `-a`/`--text`, at least with GNU grep.

    @derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first `NUL` (probably becauses it calls C's printf and gives it the matched line?). On such a system a `grep cmd .sh_history` will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a `NUL` at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)

    +1 for using -a / --text with GNU grep, because you can mix this easily with recursive search, e.g. `egrep -r -a mystring .` Thanks @derobert

    Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.

    My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.

    I had a file which `grep` on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.

    Thnx for the comment. I hadn't realized that different platform would handle the issue differently.

    Not true, grep will work with the file each line as string but will stop after the first NUL coincidence. Take a look on https://stackoverflow.com/questions/50992292/grep-not-parsing-the-whole-file

    MiguelOrtiz and cp.engr could be right. In Windows Subsystem for Linux, grep treat any file that contains Chinese characters as binary file. While grep in MobaXterm considers Chinese characters as plain text. Both of them treat `NUL` or `\0` as binary file.

    I have used grep with utf-8, so it can handle long dash in utf-8. It may depend on local. It is definitely not a file system flag.

    BSD grep (which is available on MacOS) also supports `-a` / `--text`

    doing cat, pipe, tr, pipe AGAIN... seems like a whole lot of wasted resources... when you can just use `grep --text` option... and not use up lots of extra cpu and memory (two processes, two pipes).

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM