Eliminate duplicates in ListAgg (Oracle)

  • Prior to Oracle 11.2 I was using a custom aggregate function to concatenate a column into a row. 11.2 Added the LISTAGG function, so I am trying to use that instead. My problem is that I need to eliminate duplicates in the results and don't seem to be able to do that.

    Here is an example.

    CREATE TABLE ListAggTest AS (
      SELECT rownum Num1, DECODE(rownum,1,'2',to_char(rownum)) Num2 FROM dual 
         CONNECT BY rownum<=6
      );
    SELECT * FROM ListAggTest;
    
          NUM1 NUM2
    ---------- ---------------------
             1 2
             2 2                    << Duplicate 2
             3 3
             4 4
             5 5
             6 6
    

    What I want to see is this:

          NUM1 NUM2S
    ---------- --------------------
             1 2-3-4-5-6
             2 2-3-4-5-6
             3 2-3-4-5-6
             4 2-3-4-5-6
             5 2-3-4-5-6
             6 2-3-4-5-6
    

    Here is a listagg version that is close, but doesn't eliminate duplicates.

    SELECT Num1, listagg(Num2,'-') WITHIN GROUP (ORDER BY NULL) OVER () Num2s 
    FROM ListAggTest;
    

    I have a solution, but it's worse than continuing to use the custom aggregate function.

    Should `order by null` be `order by Num2` or am I getting confused?

    @Jack - It makes no difference to the duplicate elimination. Depending on your use, it may be desirable.

    *sigh* `LISTAGG` continues to fall short of Tom Kyte's `STRAGG`, with which it is as easy as `STRAGG(DISTINCT ...)`

    Finally it is possible: LISTAGG DISTINCT

  • You can use regular expressions and regexp_replace to remove the duplicates after concatenation with listagg:

    SELECT Num1, 
           RTRIM(
             REGEXP_REPLACE(
               (listagg(Num2,'-') WITHIN GROUP (ORDER BY Num2) OVER ()), 
               '([^-]*)(-\1)+($|-)', 
               '\1\3'),
             '-') Num2s 
    FROM ListAggTest;
    

    This could be tidier if Oracle's regex flavour supported lookahead or non-capturing groups, but it doesn't.

    However this solution does avoid scanning the source more than once.

    DBFiddle here

    Note that for this REGEX_REPLACE technique to work for removing duplicates, the duplicate values must all be next to each other in the aggregated string.

    That's what `ORDER BY Num2` achieves isn't it (see here). Or are you just trying to point out that you need the ORDER BY for it to work?

  • As far as I can see, with the currently available language specification, this is the shortest to achieve what you want if it must be done with listagg.

    select distinct
           a.Num1, 
           b.num2s
      from listaggtest a cross join (
           select listagg(num2d, '-') within group (order by num2d) num2s 
           from (
             select distinct Num2 num2d from listaggtest
           )
          ) b;
    

    What was your solution that was worse than the custom aggregate solution?

    This works, but has to do two full table scans.

    When you have a small table that you need to aggregate (< 100000 rows) the performance is more than acceptable for a simple retrieve. This has been my solution of choice after almost an hour of testing each possible ways!

    This also works when the duplicates would put the intermediate value over 4000 characters. That makes it safer than the `regexp` solution.

  • Although this is an old post with an accepted answer, I think the LAG() analytic function works well in this case and is noteworthy:

    • LAG() removes duplicate values in column num2 with minimal expense
    • No need for non-trivial regular expression to filter results
    • Just one full table scan (cost=4 on simple example table)

    Here is the proposed code:

    with nums as (
    SELECT 
        num1, 
        num2, 
        decode( lag(num2) over (partition by null order by num2), --get last num2, if any
                --if last num2 is same as this num2, then make it null
                num2, null, 
                num2) newnum2
      FROM ListAggTest
    ) 
    select 
      num1, 
      --listagg ignores NULL values, so duplicates are ignored
      listagg( newnum2,'-') WITHIN GROUP (ORDER BY Num2) OVER () num2s
      from nums;
    

    The results below appear to be what the OP desires:

    NUM1  NUM2S       
    1   2-3-4-5-6
    2   2-3-4-5-6
    3   2-3-4-5-6
    4   2-3-4-5-6
    5   2-3-4-5-6
    6   2-3-4-5-6 
    
  • Create a custom aggregate function to do this.

    Oracle database provides a number of pre-defined aggregate functions such as MAX, MIN, SUM for performing operations on a set of records. These pre-defined aggregate functions can be used only with scalar data. However, you can create your own custom implementations of these functions, or define entirely new aggregate functions, to use with complex data—for example, with multimedia data stored using object types, opaque types, and LOBs.

    User-defined aggregate functions are used in SQL DML statements just like the Oracle database built-in aggregates. Once such functions are registered with the server, the database simply invokes the aggregation routines that you supplied instead of the native ones.

    User-defined aggregates can be used with scalar data as well. For example, it may be worthwhile to implement special aggregate functions for working with complex statistical data associated with financial or scientific applications.

    User-defined aggregates are a feature of the Extensibility Framework. You implement them using ODCIAggregate interface routines.

  • Here was my solution to the problem which in my opinion isn't as nice as using our custom aggregate function which already exists.

    SELECT Num1, listagg(Num2,'-') WITHIN GROUP (ORDER BY NULL) OVER () Num2s FROM (
      SELECT Num1, DECODE(ROW_NUMBER() OVER (PARTITION BY Num2 ORDER BY NULL),
         1,Num2,NULL) Num2 FROM ListAggTest
    );
    
  • You could also use a collect statement and then write a custom pl/sql function that converts the collection to a string.

    CREATE TYPE varchar2_ntt AS TABLE OF VARCHAR2(4000);
    CREATE TYPE varchar2_ntt AS TABLE OF VARCHAR2(4000);
    
    select cast(collect(distinct num2 order by num2) as varchar2_ntt) 
    from listaggtest
    

    You can use distinct and order by in a collect clause but if combined the distinct won't work as of 11.2.0.2 :(

    Workaround could be a subselect:

    select collect(num2 order by num2) 
    from 
    ( 
        select distinct num2 
        from listaggtest
    )
    

    I fail to see how a custom pl/sql function would be better than a custom aggregate function. The resulting SQL is certainly simpler for the latter. Since this problem was on 11.2.0.2 the subselect would add an additional scan which I was trying to avoid.

    I'd say a PL/SQL function called ONCE to convert the collection into a string could be better than the aggregate function called thousands of times. I think this would reduce context switches a lot.

    Your theory sounds good and was one reason I was trying to avoid the custom aggregate function and was preferring a built in aggregate function like LISTAGG. If you'd like to do some timing comparisons I'd be interested in the results.

  • Use WMSYS.WM_Concat instead.

    SELECT Num1, Replace(Wm_Concat(DISTINCT Num2) OVER (), ',', '-')
    FROM ListAggTest;
    

    Note: This function is undocumented and unsupported. See https://forums.oracle.com/forums/message.jspa?messageID=4372641#4372641.

    If you call Oracle support and you are using `wm_concat` (even if you argue the `wm_concat` is not itself causing the problem) they'd have grounds for refusing to help because it is undocumented and unsupported - not the case if you use a custom aggregate or any other supported feature.

  • I created this solution before I encountered ListAgg, but there are still occasions, such as this duplicate value issue, then this tool is useful. The version below has 4 arguments to give you control over the results.

    Explanation CLOBlist takes contructor CLOBlistParam as a parameter. CLOBlistParam has 4 arguments

    string VARCHAR2(4000) - The variable to be aggregated
    delimiter VARCHAR2(100) - The delimiting string
    initiator VARCHAR2(100) - An initial string added before the first value only.
    no_dup VARCHAR2(1) - A flag. Duplicates are suppressed if this is Y
    

    Example usage

    --vertical list of comma separated values, no duplicates.
    SELECT CLOBlist(CLOBlistParam(column_name,chr(10)||',','','Y')) FROM user_tab_columns
    --simple csv
    SELECT CLOBlist(CLOBlistParam(table_name,',','','N')) FROM user_tables
    

    Link to Gist is below.

    https://gist.github.com/peter-genesys/d203bfb3d88d5a5664a86ea6ee34eeca]1


    -- Program  : CLOBlist 
    -- Name     : CLOB list 
    -- Author   : Peter Burgess
    -- Purpose  : CLOB list aggregation function for SQL
    -- RETURNS CLOB - to allow for more than 4000 chars to be returned by SQL
    -- NEW type CLOBlistParam  - allows for definition of the delimiter, and initiator of sequence
    ------------------------------------------------------------------
    --This is an aggregating function for use in SQL.
    --It takes the argument and creates a comma delimited list of each instance.
    
    WHENEVER SQLERROR CONTINUE
    DROP TYPE CLOBlistImpl;
    WHENEVER SQLERROR EXIT FAILURE ROLLBACK
    
    create or replace type CLOBlistParam as object(
      string    VARCHAR2(4000)
     ,delimiter VARCHAR2(100)  
     ,initiator VARCHAR2(100)  
     ,no_dup    VARCHAR2(1)    )
    /
    show error
    
    --Creating CLOBlist()
    --Implement the type CLOBlistImpl to contain the ODCIAggregate routines.
    create or replace type CLOBlistImpl as object
    (
      g_list CLOB, -- progressive concatenation
      static function ODCIAggregateInitialize(sctx IN OUT CLOBlistImpl)
        return number,
      member function ODCIAggregateIterate(self  IN OUT CLOBlistImpl
                                         , value IN     CLOBlistParam) return number,
      member function ODCIAggregateTerminate(self        IN  CLOBlistImpl
                                           , returnValue OUT CLOB
                                           , flags       IN  number) return number,
      member function ODCIAggregateMerge(self IN OUT CLOBlistImpl
                                       , ctx2 IN     CLOBlistImpl) return number
    )
    /
    show error
    
    
    --Implement the type body for CLOBlistImpl.
    create or replace type body CLOBlistImpl is
    static function ODCIAggregateInitialize(sctx IN OUT CLOBlistImpl)
    return number is
    begin
    
      sctx := CLOBlistImpl(TO_CHAR(NULL));
      return ODCIConst.Success;
    end;
    
    member function ODCIAggregateIterate(self  IN OUT CLOBlistImpl
                                       , value IN     CLOBlistParam) return number is
    begin
    
       IF self.g_list IS NULL THEN
         self.g_list := value.initiator||value.string;
       ELSIF value.no_dup = 'Y' AND
             value.delimiter||self.g_list||value.delimiter LIKE '%'||value.delimiter||value.string||value.delimiter||'%' 
             THEN
         --Do not include duplicate value    
         NULL;
      ELSE
         self.g_list := self.g_list||value.delimiter||value.string;
       END IF;
    
      return ODCIConst.Success;
    end;
    
    member function ODCIAggregateTerminate(self        IN  CLOBlistImpl
                                         , returnValue OUT CLOB
                                         , flags       IN  number) return number is
    begin
      returnValue := self.g_list;
      return ODCIConst.Success;
    end;
    
    member function ODCIAggregateMerge(self IN OUT CLOBlistImpl
                                     , ctx2 IN     CLOBlistImpl) return number is
    begin
    
      self.g_list := LTRIM( self.g_list||','||ctx2.g_list,',');
    
      return ODCIConst.Success;
    end;
    end;
    /
    show error
    
    --Using CLOBlist() to create a vertical list of comma separated values
    
    --  SELECT CLOBlist(CLOBlistParam(product_code,chr(10)||',','','Y'))
    --  FROM   account
    
    
    --DROP FUNCTION CLOBlist
    --/
    
    PROMPT Create the user-defined aggregate.
    CREATE OR REPLACE FUNCTION CLOBlist (input CLOBlistParam) RETURN CLOB
    PARALLEL_ENABLE AGGREGATE USING CLOBlistImpl;
    /
    show error
    
  • I know it's sometime after the original posting, but this was the first spot I found after Googling for an answer to the same problem and thought someone else who landed here might be happy to find a succinct answer that doesn't rely on overly complicated queries or regexes.

    This will give you the desired result:

    with nums as (
      select distinct num2 distinct_nums
      from listaggtest
      order by num2
    ) select num1,
             (select listagg(distinct_nums, '-') within group (order by 1) from nums) nums2list 
             from listaggtest;
    
  • My idea is to implement a stored function like this:

    CREATE TYPE LISTAGG_DISTINCT_PARAMS AS OBJECT (ELEMENTO VARCHAR2(2000), SEPARATORE VARCHAR2(10));
    
    CREATE TYPE T_LISTA_ELEMENTI AS TABLE OF VARCHAR2(2000);
    
    CREATE TYPE T_LISTAGG_DISTINCT AS OBJECT (
    
        LISTA_ELEMENTI T_LISTA_ELEMENTI,
            SEPARATORE VARCHAR2(10),
    
        STATIC FUNCTION ODCIAGGREGATEINITIALIZE(SCTX  IN OUT            T_LISTAGG_DISTINCT) 
                        RETURN NUMBER,
    
        MEMBER FUNCTION ODCIAGGREGATEITERATE   (SELF  IN OUT            T_LISTAGG_DISTINCT, 
                                                VALUE IN                    LISTAGG_DISTINCT_PARAMS ) 
                        RETURN NUMBER,
    
        MEMBER FUNCTION ODCIAGGREGATETERMINATE (SELF         IN     T_LISTAGG_DISTINCT,
                                                RETURN_VALUE OUT    VARCHAR2, 
                                                FLAGS        IN     NUMBER      )
                        RETURN NUMBER,
    
        MEMBER FUNCTION ODCIAGGREGATEMERGE       (SELF               IN OUT T_LISTAGG_DISTINCT,
                                                                                            CTX2                 IN         T_LISTAGG_DISTINCT    )
                        RETURN NUMBER
    );
    
    CREATE OR REPLACE TYPE BODY T_LISTAGG_DISTINCT IS 
    
        STATIC FUNCTION ODCIAGGREGATEINITIALIZE(SCTX IN OUT T_LISTAGG_DISTINCT) RETURN NUMBER IS 
        BEGIN
                    SCTX := T_LISTAGG_DISTINCT(T_LISTA_ELEMENTI() , ',');
            RETURN ODCICONST.SUCCESS;
        END;
    
        MEMBER FUNCTION ODCIAGGREGATEITERATE(SELF IN OUT T_LISTAGG_DISTINCT, VALUE IN LISTAGG_DISTINCT_PARAMS) RETURN NUMBER IS
        BEGIN
    
                    IF VALUE.ELEMENTO IS NOT NULL THEN
                            SELF.LISTA_ELEMENTI.EXTEND;
                            SELF.LISTA_ELEMENTI(SELF.LISTA_ELEMENTI.LAST) := TO_CHAR(VALUE.ELEMENTO);
                            SELF.LISTA_ELEMENTI:= SELF.LISTA_ELEMENTI MULTISET UNION DISTINCT SELF.LISTA_ELEMENTI;
                            SELF.SEPARATORE := VALUE.SEPARATORE;
                    END IF;
            RETURN ODCICONST.SUCCESS;
        END;
    
        MEMBER FUNCTION ODCIAGGREGATETERMINATE(SELF IN T_LISTAGG_DISTINCT, RETURN_VALUE OUT VARCHAR2, FLAGS IN NUMBER) RETURN NUMBER IS
          STRINGA_OUTPUT            CLOB:='';
                LISTA_OUTPUT                T_LISTA_ELEMENTI;
                TERMINATORE                 VARCHAR2(3):='...';
                LUNGHEZZA_MAX           NUMBER:=4000;
        BEGIN
    
                    IF SELF.LISTA_ELEMENTI.EXISTS(1) THEN -- se esiste almeno un elemento nella lista
    
                            -- inizializza una nuova lista di appoggio
                            LISTA_OUTPUT := T_LISTA_ELEMENTI();
    
                            -- riversamento dei soli elementi in DISTINCT
                            LISTA_OUTPUT := SELF.LISTA_ELEMENTI MULTISET UNION DISTINCT SELF.LISTA_ELEMENTI;
    
                            -- ordinamento degli elementi
                            SELECT CAST(MULTISET(SELECT * FROM TABLE(LISTA_OUTPUT) ORDER BY 1 ) AS T_LISTA_ELEMENTI ) INTO LISTA_OUTPUT FROM DUAL;
    
                            -- concatenazione in una stringa                        
                            FOR I IN LISTA_OUTPUT.FIRST .. LISTA_OUTPUT.LAST - 1
                            LOOP
                                STRINGA_OUTPUT := STRINGA_OUTPUT || LISTA_OUTPUT(I) || SELF.SEPARATORE;
                            END LOOP;
                            STRINGA_OUTPUT := STRINGA_OUTPUT || LISTA_OUTPUT(LISTA_OUTPUT.LAST);
    
                            -- se la stringa supera la dimensione massima impostata, tronca e termina con un terminatore
                            IF LENGTH(STRINGA_OUTPUT) > LUNGHEZZA_MAX THEN
                                        RETURN_VALUE := SUBSTR(STRINGA_OUTPUT, 0, LUNGHEZZA_MAX - LENGTH(TERMINATORE)) || TERMINATORE;
                            ELSE
                                        RETURN_VALUE:=STRINGA_OUTPUT;
                            END IF;
    
                    ELSE -- se non esiste nessun elemento, restituisci NULL
    
                            RETURN_VALUE := NULL;
    
                    END IF;
    
            RETURN ODCICONST.SUCCESS;
        END;
    
        MEMBER FUNCTION ODCIAGGREGATEMERGE(SELF IN OUT T_LISTAGG_DISTINCT, CTX2 IN T_LISTAGG_DISTINCT) RETURN NUMBER IS
        BEGIN
            RETURN ODCICONST.SUCCESS;
        END;
    
    END; -- fine corpo
    
    CREATE
    FUNCTION LISTAGG_DISTINCT (INPUT LISTAGG_DISTINCT_PARAMS) RETURN VARCHAR2
        PARALLEL_ENABLE AGGREGATE USING T_LISTAGG_DISTINCT;
    
    // Example
    SELECT LISTAGG_DISTINCT(LISTAGG_DISTINCT_PARAMS(OWNER, ', ')) AS LISTA_OWNER
    FROM SYS.ALL_OBJECTS;
    

    I'm sorry, but in some case (for a very big set), Oracle could return this error:

    Object or Collection value was too large. The size of the value
    might have exceeded 30k in a SORT context, or the size might be
    too big for available memory.
    

    but I think this is a good point of start ;)

    Note that the OP had their own custom `LISTAGG` function already; they were explicitly trying to see if they could find an efficient way to do this using the built-in `LISTAGG` function available as of version 11.2.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM