What makes grep consider a file to be binary?
I have some database dumps from a Windows system on my box. They are text files. I'm using cygwin to grep through them. These appear to be plain text files; I open them with text editors such as notepad and wordpad and they look legible. However, when I run grep on them, it will say
binary file foo.txt matches.
I have noticed that the files contain some ascii
NULcharacters, which I believe are artifacts from the database dump.
So what makes grep consider these files to be binary? The
NULcharacter? Is there a flag on the filesystem? What do I need to change to get grep to show me the line matches?
If there is a
NULcharacter anywhere in the file, grep will consider it as a binary file.
There might a workaround like this
cat file | tr -d '\000' | yourgrepto eliminate all null first, and then to search through file.
@derobert: actually, on some (older) systems, grep see lines, but its output will truncate each matching line at the first `NUL` (probably becauses it calls C's printf and gives it the matched line?). On such a system a `grep cmd .sh_history` will return as many empty lines as there are lines matching 'cmd', as each line of sh_history has a specific format with a `NUL` at the begining of each line. (but your comment "at least on GNU grep" probably comes true. I don't have one at hand right now to test, but I expect they handle this nicely)
+1 for using -a / --text with GNU grep, because you can mix this easily with recursive search, e.g. `egrep -r -a mystring .` Thanks @derobert
Is the presence of a NUL character the only criteria? I doubt it. It's probably smarter than that. Anything falling outside the Ascii 32-126 range would be my guess, but we'd have to look at the source code to be sure.
My info was from the man page of the specific grep instance. Your comment about implementation is valid, source trumps docs.
I had a file which `grep` on cygwin considered binary because it had a long dash (0x96) instead of a regular ASCII hyphen/minus (0x2d). I guess this answer resolved the OP's issue, but it appears it is incomplete.
Thnx for the comment. I hadn't realized that different platform would handle the issue differently.
Not true, grep will work with the file each line as string but will stop after the first NUL coincidence. Take a look on https://stackoverflow.com/questions/50992292/grep-not-parsing-the-whole-file
MiguelOrtiz and cp.engr could be right. In Windows Subsystem for Linux, grep treat any file that contains Chinese characters as binary file. While grep in MobaXterm considers Chinese characters as plain text. Both of them treat `NUL` or `\0` as binary file.
I have used grep with utf-8, so it can handle long dash in utf-8. It may depend on local. It is definitely not a file system flag.