Filtering a dataframe

  • Still learning basic functions in R, The subset function seems to only filter based a condition based on single column with or without multiple conditions?

    How can I easily filter data from a dataframe?

    1. when you are provided with multiple conditions

    2. When the condition needs to be applied across the available columns.

    Example: Given a data frame containing

    name    D1      D2     D3      D4
    julius  "A"     "A"    "B"     "B"
    cate    "D"     "E"     "A"     "C"
    karo    "A"     "D"     "C"     "E"
    

    say I want to filter this dataframe so that only names where any of the D1 to D4 is an 'E' then I should have,

    name    D1      D2     D3      D4
    cate    "D"     "E"     "A"     "C"
    karo    "A"     "D"     "C"     "E"
    

    Say that the D1 can be a big list of columns, how or what is the recommended approach to perform this filter? Thank you

  • mpiktas

    mpiktas Correct answer

    10 years ago

    If you want to combine several filters in subset function use logical operators:

     subset(data, D1 == "E" | D2 == "E")
    

    will select those rows for which either column D1 or column D2 has value "E". Look at the help pages for available logical operators:

     > ?"|"
    

    For your second question what you need is to filter the rows. This can be achieved in the following way

     collist <- c("D1","D2","D3","D4")
     sel <- apply(data[,collist],1,function(row) "E" %in% row)
     data[sel,]
    

    The first argument to apply suplies the columns on which we need to filter. The second argument is 1, meaning that we are looping through rows of the data. The third argument is unnamed one-line function which returns TRUE if "E" is present in the row and FALSE if the "E" is not present.

    The result of the apply function will be logical vector sel, which has the length the same as number of rows in data. We then use this vector to select the necessary rows.

    Update

    The same can be achieved with grep:

    sel <- apply(data[,collist],1,function(row) length(grep("E",row))>0)
    

    in R grep with default arguments returns the numbers of elements in the supplied vector which have the matching pattern.

    another useful function is `any`. For instance if you want to check that at least one element of a vector is `= 10` you could write (`any(v==10)`).

    @nico yeah, but **10 %in% v** has 9 characters and variant with **any** 10 :). Although for numeric vectors it is better to use equality, since R is smart and if your data is integers mixed with real numbers it will correctly recognize that you have 10 in your data set.

    what about the case for regular expression? assuming that you do not want to work with an exact match? Do they have something close to ~= as in other languages? what is most close to that expression in R

    @Biorelated See `grep` and `agrep`, and related documentation on POSIX 1003.2 or Perl-compatible regex.

    @Biorelated, I've update the answer with the example of grep.

    Thanks a tonne on this!!!! the explanation and the references! awesome.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM