How to merge two files based on the matching of two columns?

  • I have file1 likes:

    0   AFFX-SNP-000541  NA
    0   AFFX-SNP-002255  NA
    1   rs12103          0.6401
    1   rs12103_1247494  0.696
    1   rs12142199       0.7672
    

    And a file2:

    0   AFFX-SNP-000541   1
    0   AFFX-SNP-002255   1
    1   rs12103           0.5596
    1   rs12103_1247494   0.5581
    1   rs12142199        0.4931
    

    And would like a file3 such that:

    0   AFFX-SNP-000541     NA       1
    0   AFFX-SNP-002255     NA       1
    1   rs12103             0.6401   0.5596
    1   rs12103_1247494     0.696    0.5581
    1   rs12142199          0.7672   0.4931
    

    Which means to put the 4th column of file2 to file1 by the name of the 2nd column.

    File2 only got three columns?

  • terdon

    terdon Correct answer

    7 years ago

    This should do it:

    join -j 2 -o 1.1,1.2,1.3,2.3 file1 file2
    

    Important: this assumes your files are sorted (as in your example) according to the SNP name. If they are not, sort them first:

    join -j 2 -o 1.1,1.2,1.3,2.3 <(sort -k2 file1) <(sort -k2 file2)
    

    Output:

    0 AFFX-SNP-000541 NA 1
    0 AFFX-SNP-002255 NA 1
    1 rs12103 0.6401 0.5596
    1 rs12103_1247494 0.696 0.5581
    1 rs12142199 0.7672 0.4931
    

    Explanation (from info join):

    `join' writes to standard output a line for each pair of input lines that have identical join fields.

    `-1 FIELD'
         Join on field FIELD (a positive integer) of file 1.
    
    `-2 FIELD'
         Join on field FIELD (a positive integer) of file 2.
    
    `-j FIELD'
         Equivalent to `-1 FIELD -2 FIELD'.
    
    `-o FIELD-LIST'
    
     Otherwise, construct each output line according to the format in
     FIELD-LIST.  Each element in FIELD-LIST is either the single
     character `0' or has the form M.N where the file number, M, is `1'
     or `2' and N is a positive field number.
    

    So, the command above joins the files on the second field and prints the 1st,2nd and 3rd field of file one, followed by the 3rd field of file2.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM