Is space not allowed in a filename?

  • It is said that on Unix and Linux in general, you should avoid having spaces in a filename of a file (ordinary file, dir, link, device file, ...).

    But I do that all the time. For a filename with a space inside,

    • In Nautilus, the space character is shown as a space.
    • In Bash terminal, I either use \ to represent a space, or enclose the filename within a pair of double quotes.
    • in some applications's files (Nautilus, not sure if OS will also do so), the filename is written with the space replaced with %20.

    Is a space really not allowed in a filename?

    How do you use or deal with a space in a filename correctly?

    It's allowed but it's really, really annoying. There is no reason for it. Don't do it.

    You can also create a files named `-rf ~` (use `touch -- "-rf ~"`), but I wouldn't recommend it.

    You can do it, it's allowed, like creating a self-destruct script called "cd" but you shouldn't do it. Your file already looks different in 3 different tools, isn't that bad enough?

    A really dangerous file name would be `; rm -rf * .*` (yes, that's an allowed filename, too). Now imagine having that file in your directory, and then entering a seemingly harmless `echo *` ... actually, this also shows that wildcards should be used with extreme care when accessing directories where others can create files (e.g. in `/tmp`).

    OK, I just noticed that glob expansion appears to do implicit quoting (well, at least in bash). However `eval echo *` will trigger the malicious code (while nobody would type that directly, the equivalent might happen indirectly through a badly written script).

    Not everyone shares the opinion that it's really, really annoying. And "There is no reason for it" is so obviously false that it doesn't need refuting. I gave in and learned how to handle spaces properly years ago, and for the most part it's really not a big deal.

    @snailboat Spaces are a symptom of the real problem which is a lack of standardization. Unix filesystems allow file "names" to nearly unrestricted binary blobs. The only illegal bytes are 0 and 47 (the `/` separator). Using all 254 remaining bytes opens the door to all manners of unspeakable eldritch "names". Obviously this is insane, but not everyone agrees on what "sane" is, and different characters will break different tools. The intersection of everyone's sanity is quite small.

    As already mentioned, there are other inconvenient characters besides space. Tab is equivalent to space in many places. Newline has the added "benefit" of breaking line-oriented output. Shell scripting with such filenames becomes a nightmare, e.g. because `xargs` is really hard to use, unless using non-standard features line NUL delimiter, which harms portability on the other hand.

  • Spaces, and indeed every character except / and NUL, are allowed in filenames. The recommendation to not use spaces in filenames comes from the danger that they might be misinterpreted by software that poorly supports them. Arguably, such software is buggy. But also arguably, programming languages like shell scripting make it all too easy to write software that breaks when presented with filenames with spaces in them, and these bugs tend to slip through because shell scripts are not often tested by their developers using filenames with spaces in them.

    Spaces replaced with %20 is not often seen in filenames. That's mostly used for (web) URLs. Though it's true that %-encoding from URLs sometimes makes its way into filenames, often by accident.

    what is the name of %-encoding? Unicode, utf-8, ...?

    It's "URL encoding" or "percent encoding" http://en.wikipedia.org/wiki/URL_encoding As per that the most appropriate name is probably "URI encoding", but people find *url* easier to say than *U.R.I.*, so this is a common form of misnomer. Notice the set of reserved characters in URI's is larger than it is for *nix filenames.

    @Tim I don't know that you *can* specify a NUL character in any command line argument in `bash`. I tried a few things such as quoting it with Ctrl-V and something like `$(echo -e \\0)` but it didn't work. The thing is, the reason NUL can't be used in filenames is that it can't be used in C strings (because it's the string terminator) and all the underlying APIs as well as virtually all strings handled by C programs use that format. Since `bash` is written in C, it might simply have no support at all for any strings with NUL in them. I could be wrong, there might be some obscure way...

    is the NULL character the same as an empty string?

    @Tim How could a character be the same as a string?

    Sort of depends on the context. String functions generally don't count the final null (or rather, the first null is the end of the string, even if there's stuff after it), so in that sense it has zero length and therefore would be considered empty.

    NULL character is `byte(0)`, where space is `byte(32)`, an empty string is well, empty, there are no bytes in it.

    @Tim @goldilocks @OneOfOne There is only one `L` in `NUL`. `NUL` is the termination character for C-strings. `NULL` is the value of a pointer, which doesn't point to anything.

    Generally, languages do poorly if they are based on text manipulations, e.g. bash and make, without higher-level constructs. IMO, spaces should have been disallowed, just as `/` was.

    @PaulDraper languages like Python handle NULs in strings just fine. I don't think there is a good reason to disallow spaces in filenames. After all, control characters, even "dangerous" things like VT100 escape sequences, are allowed in filenames. There is, however, a good argument to be made for forcing filenames to be well-formed UTF-8. But even that argument won't be accepted by purists of POSIX, in which filenames can be arbitrary sequences of any byte with only 0x00 (disallowed, end of string) and 0x2F (special meaning, pathname separator) as exceptions.

    @Celada, (1) Python is much more powerful than make or bash. E.g. it has lists. Make and bash use spaces for lists, which of course often breaks if there are spaces within filenames. (2) Clearly some characters are disallowed, as you mentioned 0x0 and 0x2F. 0x20 would have good precedent. But that's really neither here nor there as it won't change.

    @Celada Enforcing valid UTF-8 at the kernel level would be a problem, because the kernel doesn't always know, which encoding is used. The APIs are older than UTF-8. There are still file systems around on which the names are using ISO-8859-1 encoding. At some point it just became common to replace the entire user level running on top of the kernel with one using UTF-8 encoding.

    @PaulDraper Disallowing space isn't going to do much good as long as other white space characters are still allowed. There are constructions which handle space just fine, but break on newline. For example pipe the default output format of `find` into a `while read FILENAME` loop.

    @Celada of course you can use `NUL` and bash, you need `$'\0'`. For example: `find . -print0 | while read -d $'\0' f; do echo "$f"; done`

    Good catch @terdon I didn't think of `$'\0'`. Still, as @goldilocks said in a comment on the other answer, it would only ever be able to use it internally (like in your example, `read` being a builtin) because `execve()` & co. don't support passing it as a command line argument to any external command.

    @terdon: That's not really true. `$'\0'` is equivalent to `''` -- and (for example) `$'foo\0bar'` is equivalent to `foo`. You may prefer to write `$'\0'` instead of `''` when the notation better matches the semantics you have in mind, but don't let it deceive you.

    @ruakh I'm deleting our comments here since they're not really that relevant. I might post a question about this and will let you know if so. Thanks for the info anyway.

    @goldilocks Do people actually pronounce URL as 'url', roughly rhyming with 'earl'?

    @MilesRout I don't think it's too unusual. The bit about that's why it's called URL encoding is speculation on my part though, lol. It could also be because they're the most common form of URI ;) Actually the two Berners-Lee RFCs that wikipedia references are from 1994 and 1998; it's explained the same way in both, but the topic of the first one is URLs whereas the second is URIs. But they just call it encoding or "escaped encoding" (the `%` I guess being analogous to an escape character).

    @terdon: Since you never posted the question, I've taken the liberty of posting a self-answered one: http://unix.stackexchange.com/q/174016/12378. Please let me know if you have any feedback on it.

  • Spaces are allowed in filenames, as you have observed.

    If you look at the "most UNIX filesystems" entry in this chart in wikipedia, you'll notice:

    • Any 8-bit character set is allowed. We can subsume 7-bit ASCII under this umbrella too, since it is a subset of various 8-bit sets and is always implemented using 8 bit bytes.

    • The only forbidden characters are / and "null". "Null" refers to a zero byte, but these are not allowed in text data anyway.

    However, if you make any use of the shell, you may realize that there are some characters that will create a hassle, most significantly *, which is a POSIX globbing operator.

    Depending on how you want to define "hassle", you could include whitespace (spaces, tabs, newlines, etc.) in there, as this creates the need for quoting with "". But this is inevitable, since spaces are allowed, so...

    How do you use or deal with a space in a filename correctly?

    In a shell/command line context, wrap the filename in single or double quotes (but note they are not the same WRT other issues), or escape the spaces with \, e.g.:

    > foo my\ file\ with\ spaces\ in\ the\ name
    

    How do you specify NUL character in bash? I want to test it in a filename.

    You can't. The "execve semantics" refers to the fact that in C (and every other language I'm aware of), text strings are null terminated. The shell is implemented in C. The sneakest thing I could think of is `touch $(echo -e "foo\00bar")` -- `-e` processes `\0N` as an octal value, but it still gets lost somewhere, as that just creates a file named `foobar`. Of course NULL isn't printable, but I guarantee it's gone from there because of the C string restriction.

    *"text strings are null terminated"* -> To explain further: strings are always stored with a zero byte at the end, which is why it "isn't allowed" in text: If you insert one, you've effectively terminated the string at that point. Eg., `foo[NULL]bar` would end up as `foo` for most intents and purposes. The fact that doesn't happen with that `echo -e` shows the NULL has been pruned out somewhere.

    A vast majority of programming languages do allow null characters in strings. It just happens that the main language that doesn't is C, which Unix is built on — and most Unix shells don't allow null characters in strings either. In any case, @Tim, all Unix interfaces use null-terminated strings, so a null byte is the one thing you cannot ever have in a file name (plus `/` which is the directory separator and cannot be quoted, so can be in a pathname but not in a filename).

    @Gilles Point taken regarding the vast majority of languages. I was thinking [never mind] wrongly. Point here remains the same otherwise.

    @goldilocks, you apparently are *only* aware of C. You don't have to go very far to find strings that aren't null-terminated, e.g. C++.

    @PaulDraper No, actually I was thinking of C++ in relation to Tim's question "is the NULL character the same as an empty string?" (from a comment on Celada's that happened about the same time). A C++ string `string str("\0")` has a `.length()` of 0 and is `.empty()`. A `string s("hello\0world")` has a length of 5 -- but of course that's because they're being initialized from `const char*`...most of my coding experience is in C, C++, java, and perl. You *can* have a zero in a string in the last three...

    ...but [never mind again]. Not something I would do too often, anyway. To my mind there's no reason for them to be in textual data. I would have corrected that, but it's a comment.

    And then you get a file which also includes double quotes in the filename...

    @Falco So `"file name with \" in it"`.

  • The reason is largely historical - WAY back in the mists of time spaces were not allowed in filenames, so spaces were used as keyword / filename separators. Future shell interpreters had to be reverse-compatible with old scripts, and thus we are stuck with the headache we have today.

    Developers of processes that do not need to deal with humans very much can make things much, much easier by dropping spaces altogether. Apple does this, the contents of /System/Library/CoreServices/ contains very few spaces, the programs with spaces are opened on behalf of the user, andWouldLookStrangeIfCamelCased. Similar unix-only paths also avoid spaces.

    ( somewhat related anecdote: in the mid-90's a Windows drone said "Name one thing you can do on a Mac that I can't do on Windows" -> "Use 12 characters in a filename." -> Silence. Spaces were also possible in those 12 characters)

    I used to use V6 Unix (c. 1978). Spaces *were* allowed then. One task I had was to write a program to parse the file system (using direct disk i/o) and look for a file which had spaces and backspaces in its name.

    do they drop spaces altogether - or do the filenames contain a very few spaces?

  • So yes, as is stated many times elsewhere, a filename can contain nearly any character. But it needs to be said that a filename is not a file. It does carry some weight as a file attribute in that you typically need a filename to open a file, but a file's name only points to the actual file. It is a link, stored in the directory that has recorded it, alongside the inode number - which is a much closer approximation to an actual file.

    So, you know, call it whatever you want. The kernel doesn't care - all file references it will handle will deal with real inode numbers anyway. The filename is a thing for human consumption - if you wanna make it a crazy thing, well, it's your filesystem. Here, I'll do some crazy stuff:

    First I'll create 20 files, and name them with nothing but spaces, each filename containing one more space than the last:

    until [ $((i=$i+1)) -gt 20 ]
    do  v=$v' ' && touch ./"$v"
    done
    

    This is kinda funny. Look at my ls:

    ls -d ./*
    ./      ./          ./              ./                  ./                 
    ./      ./          ./              ./                  ./                  
    ./      ./          ./              ./                  ./                   
    ./      ./          ./              ./                  ./     
    

    Now I'm going to mirror this directory:

    set -- * ; mkdir ../mirror
    ls -i1qdU -- "[email protected]" |
    sh -c 'while read inum na
        do  ln -T "$1" ../mirror/$inum
        shift ; done' -- "[email protected]"
    ls -d ../mirror/*
    

    Here are ../mirror/'s contents:

    ../mirror/423759  ../mirror/423764  ../mirror/423769  ../mirror/423774
    ../mirror/423760  ../mirror/423765  ../mirror/423770  ../mirror/423775
    ../mirror/423761  ../mirror/423766  ../mirror/423771  ../mirror/423776
    ../mirror/423762  ../mirror/423767  ../mirror/423772  ../mirror/423777
    ../mirror/423763  ../mirror/423768  ../mirror/423773  ../mirror/423778
    

    Ok, but maybe you're asking - but what good is that? How can you tell which is which? How can you even be sure you linked the right inode number to the right filename?

    Well...

    echo "heyhey" >>./'    ' 
    tgt=$(ls -id ./'    ')
    cat ../mirror/${tgt%% .*} \
        $(ls -1td ../mirror/* | head -n1) 
    

    OUTPUT

    heyhey
    heyhey
    

    See, both the inode number contained in ../mirror/"${tgt%% .*}" and that referenced by ./' ' refer to the same file. They describe the same file. They name it, but nothing more. There is no mystery, really, just some inconvenience you might make for yourself, but which will ultimately have little to no effect on the operation of your unix filesystem in the end.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM