Matching single column against multiple values without self-joining table in MySQL

  • We have a table that we use to store answers to questions. We need to be able to find users that have certain answers to particular questions. So, if our table consists of the following data:

    user_id     question_id     answer_value  
    Sally        1               Pooch  
    Sally        2               Peach  
    John         1               Pooch  
    John         2               Duke
    

    and we want to find users who answer 'Pooch' for question 1 and 'Peach' for question 2, the following SQL will (obviously) not worK:

    select user_id 
    from answers 
    where question_id=1 
      and answer_value = 'Pooch'
      and question_id=2
      and answer_value='Peach'
    

    My first thought was to self-join the table for each answer we are looking for:

    select a.user_id 
    from answers a, answers b 
    where a.user_id = b.user_id
      and a.question_id=1
      and a.answer_value = 'Pooch'
      and b.question_id=2
      and b.answer_value='Peach'
    

    This works, but since we allow an arbitrary number of search filters, we need to find something much more efficient. My next solution was something like this:

    select user_id, count(question_id) 
    from answers 
    where (
           (question_id=2 and answer_value = 'Peach') 
        or (question_id=1 and answer_value = 'Pooch')
          )
    group by user_id 
    having count(question_id)>1
    

    However, we want users to be able to take the same questionnaire twice, so they could potentially have two answers to question 1 in the answers table.

    So, now I'm at a loss. What's the best way to approach this? Thanks!

  • We were joining the user_id from the answers table in a chain of joins to get data from other tables, but isolating the answer table SQL and writing it in such simple terms helped me spot the solution:

    SELECT user_id, COUNT(question_id) 
    FROM answers 
    WHERE
      (question_id = 2 AND answer_value = 'Peach') 
      OR (question_id = 1 AND answer_value = 'Pooch')
    GROUP by user_id 
    HAVING COUNT(question_id) > 1
    

    We were unnecessarily using a second sub-query.

    i like you answer

  • I have found a clever way to do this query without a self join.

    I ran these commands in MySQL 5.5.8 for Windows and got the following results:

    use test
    DROP TABLE IF EXISTS answers;
    CREATE TABLE answers (user_id VARCHAR(10),question_id INT,answer_value VARCHAR(20));
    INSERT INTO answers VALUES
    ('Sally',1,'Pouch'),
    ('Sally',2,'Peach'),
    ('John',1,'Pooch'),
    ('John',2,'Duke');
    INSERT INTO answers VALUES
    ('Sally',1,'Pooch'),
    ('Sally',2,'Peach'),
    ('John',1,'Pooch'),
    ('John',2,'Duck');
    
    SELECT user_id,question_id,GROUP_CONCAT(DISTINCT answer_value) given_answers
    FROM answers GROUP BY user_id,question_id;
    
    +---------+-------------+---------------+
    | user_id | question_id | given_answers |
    +---------+-------------+---------------+
    | John    |           1 | Pooch         |
    | John    |           2 | Duke,Duck     |
    | Sally   |           1 | Pouch,Pooch   |
    | Sally   |           2 | Peach         |
    +---------+-------------+---------------+
    

    This display reveals that John gave two different answers to question 2 and Sally gave two different answers to question 1.

    To catch which questions were answered differently by all users, just place the above query in a subquery and check for a comma in the list of given answers to get the count of distinct answers as follows:

    SELECT user_id,question_id,given_answers,
    (LENGTH(given_answers) - LENGTH(REPLACE(given_answers,',','')))+1 multianswer_count
    FROM (SELECT user_id,question_id,GROUP_CONCAT(DISTINCT answer_value) given_answers
    FROM answers GROUP BY user_id,question_id) A;
    

    I got this:

    +---------+-------------+---------------+-------------------+
    | user_id | question_id | given_answers | multianswer_count |
    +---------+-------------+---------------+-------------------+
    | John    |           1 | Pooch         |                 1 |
    | John    |           2 | Duke,Duck     |                 2 |
    | Sally   |           1 | Pouch,Pooch   |                 2 |
    | Sally   |           2 | Peach         |                 1 |
    +---------+-------------+---------------+-------------------+
    

    Now just filter out rows where multianswer_count = 1 using another subquery:

    SELECT * FROM (SELECT user_id,question_id,given_answers,
    (LENGTH(given_answers) - LENGTH(REPLACE(given_answers,',','')))+1 multianswer_count
    FROM (SELECT user_id,question_id,GROUP_CONCAT(DISTINCT answer_value) given_answers
    FROM answers GROUP BY user_id,question_id) A) AA WHERE multianswer_count > 1;
    

    This is what I got:

    +---------+-------------+---------------+-------------------+
    | user_id | question_id | given_answers | multianswer_count |
    +---------+-------------+---------------+-------------------+
    | John    |           2 | Duke,Duck     |                 2 |
    | Sally   |           1 | Pouch,Pooch   |                 2 |
    +---------+-------------+---------------+-------------------+
    

    Essentially, I performed three table scans: 1 on the main table, 2 on the small subqueries. NO JOINS !!!

    Give it a Try !!!

    I always appreciate the level of effort you put into your answers.

  • I like join method, myself:

    SELECT a.user_id FROM answers a
    INNER JOIN answers a1 ON a1.question_id=1 AND a1.answer_value='Pooch'
    INNER JOIN answers a2 ON a2.question_id=2 AND a2.answer_value='Peach'
    GROUP BY a.user_id
    

    Update After testing with a larger table (~1 million rows), this method took significantly longer than the simple OR method mentioned in the original question.

    Thanks for the reply. The issue is that this could potentially be a large table, and having to join it 5-6 times may mean taking a huge performance hit, correct?

    good quesiton. i'm writing a testcase to test it out, as i don't know...will post results when it's done

    so I inserted 1 million rows with random user, question/answer pairs. Join is still going at 557seconds and your OR query finished in 1.84 seconds...going to sit in a corner now.

    do you have indexes on the test table? If you're scanning million rows table few times it will be a bit slow, no doubt :-).

    @Marian yeah, i added an index on (question_id,answer_value) problem is the cardinality is extremely low, so it doesn't help much (each join was 100-200k rows scanned)

  • If you have a large set of data, I would do two indexes:

    • question_id, answer_value, user_id; and
    • user_id, question_id, answer_value.

    You will need to join multiple times because of the way the data is organized. If you know which value for which question is least common you may be able to speed the query a bit, but the optimizer should do it for you.

    Try the query as:

    SELECT a1.user_id FROM answers a1
    WHERE a1.question_id=1 AND a1.answer_value='Pooch'
    INNER JOIN answers a2 ON a2.question_id=2 
       AND a2.answer_value='Peach' AND a1.user_id = a2.user_id

    Table a1 should use the first index. Depending on data distribution the optimizer may use either index. The entire query should be satisfied from the indexes.

  • One way to approach it is to get a subset of user_id and test those for the second match:

    SELECT user_id 
    FROM answers 
    WHERE question_id = 1 
    AND answer_value = 'Pooch'
    AND user_id IN (SELECT user_id FROM answers WHERE question_id=2 AND answer_value = 'Peach');
    

    Using Rolando's structure:

    CREATE TABLE answers (user_id VARCHAR(10),question_id INT,answer_value VARCHAR(20));
    INSERT INTO answers VALUES
    ('Sally',1,'Pouch'),
    ('Sally',2,'Peach'),
    ('John',1,'Pooch'),
    ('John',2,'Duke');
    INSERT INTO answers VALUES
    ('Sally',1,'Pooch'),
    ('Sally',2,'Peach'),
    ('John',1,'Pooch'),
    ('John',2,'Duck');
    

    Yields:

    mysql> SELECT user_id FROM answers WHERE question_id = 1 AND answer_value = 'Pooch' AND user_id IN (SELECT user_id FROM answers WHERE question_id=2 AND answer_value = 'Peach');
    +---------+
    | user_id |
    +---------+
    | Sally   |
    +---------+
    1 row in set (0.00 sec)
    

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM