What's the best way to join files again after splitting them?

  • If I have a large file and need to split it into 100 megabyte chunks I will do

    split -b 100m myImage.iso
    

    That usually give me something like

    xaa
    xab
    xac
    xad
    

    And to get them back together I have been using

    cat x* > myImage.iso
    

    Seems like there should be a more efficient way than reading through each line of code in a group of files with cat and redirecting the output to a new file. Like a way of just opening two files, removing the EOF marker from the first one, and connecting them - without having to go through all the contents.

    Windows/DOS has a copy command for binary files. The help mentions that this command was designed to able able to combine multiple files. It works with this syntax: (/b is for binary mode)

    copy /b file1 + file2 + file3 outputfile
    

    Is there something similar or a better way to join large files on Linux than cat?

    Update

    It seems that cat is in fact the right way and best way to join files. Glad to know i was using the right command all along :) Thanks everyone for your feedback.

    Why do you think 'cat x* > myImage.iso' is 'more efficient' than 'copy /b file1 + file2 + file3 outputfile'?

    Side note: Better not use `cat x*`, because the order of files depends on your locale settings. Better start typing `cat x`, than press *Esc* and then `*` - you'll see the expanded order of files and can rearrange.

    Instead of `cat x*` you could consider shell brace expansion, `cat xa{a..g}` which expands the specified sequence to `cat` *xaa xab xac xad xae xaf xag*

    @symcbean - I actually was thinking that a command like `copy` (on windows) seemed like a more efficient method than `cat`, party beacuse help for `copy` mentions that it can be used this way. I knew that `cat` would work to join files, and it works quickly with small files, but I was trying to ask if there was a better way to join files - especially very large files.

    @rozcietrzewiacz - can you give an example of how I would adjust my locale setting that would break `cat x*` ? Would the new locale setting not also affect `split` so that if `split` and `cat x*` were used on the same system they would always work?

    "opening two files, removing the EOF marker from the first one, and connecting them - without having to go through all the contents."... sounds like you need to invent a new filesystem in order to do what you want

    @JoelFan - or just acquire a deeper understand the capabilities of the existing file system.

    `copy /b … outputfile` does exactly what `cat … >outputfile` does. The `/b` flag tells `copy` not to mess up the data, and the syntax of `copy` is weird, but under the hood they do the same job.

    @Giles - thanks, that makes me feel better. the whole point of the question was just to make sure I'm doing this the 'right' way - and from the response it seems very apparent that `cat` is in fact the best way.

    @rozcietrzewiacz: I think the `split` command constructs its output file names in a manner that isn't susceptible to locale-specific reordering. (Though I suppose you could create a customized locale in which the 26 lowercase Latin letters aren't in their usual order.)

    @cwd: Looking at `split.c` in GNU Coreutils, the suffixes are constructed from a fixed array of characters: `static char const *suffix_alphabet = "abcdefghijklmnopqrstuvwxyz";`. The suffix wouldn't be affected by the locale. (But I don't think any sane locale would reorder the lowercase letters; even EBCDIC maintains their standard order.)

    @Keith & cwd: Sorry, I overlooked the first prompt. In case of files produced with `split`, I agree with Keith. I was referring to a general habit of concatenating files. And, more broadly, feeding a list of files to a command.

    @Davide notes: "Tip: To be sure that no errors occurred when splitting and joining is to calculate an hash of source (before splitting) and compare that with the file resulting from the merge if the 2 hashes match I can be sure the procedure produced no errors. so when giving out a splitted files always give the hash"

    @Peter.O you can nest brace expansion `cat x{{a..j}{a..z},k{a..f}} > myImage.iso`. That will expand from `xaa` to `xkf`.

  • That's just what cat was made for. Since it is one of the oldest GNU tools, I think it's very unlikely that any other tool does that faster/better. And it's not piping - it's only redirecting output.

    The `cat x, then press Esc` trick you mentioned is neat.. I've been looking for something like that, thanks... good comment and good answer

    You're welcome :) Also, when you have that list of files on the command line, you can use `Ctrl+W` to cut out a word and then `Ctrl+Y` to paste it.

    cat means "concatenate"

    .. and "catenate" derrives from a Latin word "catena" which means "a chain".. *concatenating* is joining up the links of a chain. ... (and a bit further off-topic, a *catenary curve* also derrives from "catena". It is the way a chain hangs)

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM