Best practice between using LEFT JOIN or NOT EXISTS

  • Is there a best practice between using a LEFT JOIN or a NOT EXISTS format?

    What is benefit to using one over the other?

    If none, which should be preferred?

    SELECT *
    FROM tableA A
    LEFT JOIN tableB B
         ON A.idx = B.idx
    WHERE B.idx IS NULL
    

    SELECT *
    FROM tableA A
    WHERE NOT EXISTS
    (SELECT idx FROM tableB B WHERE B.idx = A.idx)
    

    I am using queries within Access against a SQL Server database.

    As an aside, the seemingly-identical approach `WHERE A.idx NOT IN (...)` is **not** identical due to the trivalent behavior of `NULL` (i.e. `NULL` is not equal to `NULL` (nor unequal), therefore if you have *any* `NULL` in `tableB` you will get unexpected results!)

  • Max Vernon

    Max Vernon Correct answer

    5 years ago

    The biggest difference is not in the join vs not exists, it is (as written), the SELECT *.

    On the first example, you get all columns from both A and B, whereas in the second example, you get only columns from A.

    In SQL Server, the second variant is slightly faster in a very simple contrived example:

    Create two sample tables:

    CREATE TABLE dbo.A
    (
        A_ID INT NOT NULL
            PRIMARY KEY CLUSTERED
            IDENTITY(1,1)
    );
    
    CREATE TABLE dbo.B
    (
        B_ID INT NOT NULL
            PRIMARY KEY CLUSTERED
            IDENTITY(1,1)
    );
    GO
    

    Insert 10,000 rows into each table:

    INSERT INTO dbo.A DEFAULT VALUES;
    GO 10000
    
    INSERT INTO dbo.B DEFAULT VALUES;
    GO 10000
    

    Remove every 5th row from the second table:

    DELETE 
    FROM dbo.B 
    WHERE B_ID % 5 = 1;
    
    SELECT COUNT(*) -- shows 10,000
    FROM dbo.A;
    
    SELECT COUNT(*) -- shows  8,000
    FROM dbo.B;
    

    Perform the two test SELECT statement variants:

    SELECT *
    FROM dbo.A
        LEFT JOIN dbo.B ON A.A_ID = B.B_ID
    WHERE B.B_ID IS NULL;
    
    SELECT *
    FROM dbo.A
    WHERE NOT EXISTS (SELECT 1
        FROM dbo.B
        WHERE b.B_ID = a.A_ID);
    

    Execution plans:

    enter image description here

    The second variant does not need to perform the filter operation since it can use the left anti-semi join operator.

  • Logically they are identical, but NOT EXISTS is closer to the AntiSemiJoin that you're asking for, and is generally preferred. It also highlights better that you can't access the columns in B, because it's only used as a filter (as opposed to having them available with NULL values).

    Many years ago (SQL Server 6.0 ish), LEFT JOIN was quicker, but that hasn't been the case for a very long time. These days, NOT EXISTS is marginally faster.


    The biggest impact in Access is that the JOIN method has to complete the join before filtering it, constructing the joined set in memory. Using NOT EXISTS it checks for the row but doesn't allocate space for the columns. Plus, it stops looking once it finds a row. Performance varies a bit more in Access, but a general rule of thumb is that NOT EXISTS tends to be a little faster. I'd be less inclined to say it's "best practice", as there are more factors involved.

  • An exception I've noticed to the NOT EXISTS being superior (however marginally) to LEFT JOIN ... WHERE IS NULL is when using Linked Servers.

    From examining the execution plans, it appears that NOT EXISTS operator gets executed in a nested loop fashion. Whereby it is executed on a per row basis (which I suppose makes sense).

    Example execution plan demonstrating this behaviour: enter image description here

    Linked servers are brutal for that kind of thing. A possible approach to solving that problem is to copy the remote data over the linked server link using a simple `INSERT INTO #t (a,b,c) SELECT a,b,c FROM LinkedServer.database.dbo.table WHERE x=y` then running the `NOT EXISTS (...)` clause against that temporary copy of the database.

  • In general, the engine will create an execution plan based essentially on:

    1. The number of rows in A and B
    2. Whether there is an Index on A and/or B.
    3. The expected number of result rows (and intermediate rows)
    4. The form of the input query (i.e. your question)

    For (4):

    The "not exists" plan encourages a seek based plan on table B. This is a good choice when table A is small and table B is large (and an index exists on B).

    The "antijoin" plan is a good choice when table A is very large or table B is very small or no index on B and returning a large result set.

    However it is just an "encouragement", like a weighted input. A strong (1),(2),(3) often makes the choice for (4) moot.

    ( Ignoring the effect of your example returning different columns due to the *, addressed by @MaxVernon answer. ).

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM