Wednesday, August 8, 2012

Distinct vs group by in Teradata


Distinct vs group by in Teradata .

There is always a debate going on when finding out unique values in a given table.  The problem comes into picture when we use Group by or distinct to find it.
Both return same number of rows , but with some execute  time difference between them. Execution time is always a very important factor considering performance as one of the major factors is teradata warehouse.

So which is more efficient ?? DISTINCT or GROUP BY???

Since DISTINCT redistributes the rows immediately, more data may move between the AMPs, where as  GROUP BY that only sends unique values between the AMPs. 
So, we can say that  GROUP BY sounds more efficient. 
 
    But when you assume that data is nearly unique in a table, GROUP BY will  spend more time attempting to eliminate duplicates that do not exist at all.Therefore, it is wasting its  time to check for duplicates the first time. Then, it must redistribute the same amount of data .

Let us see how these steps are used in each case for elimination of Duplicates
(can be found out using explain plan)

DISTINCT

1. It reads each row on AMP
2. Hashes the column value identified in the distinct clause of select statement.
3. Then redistributes the rows according to row value into appropriate AMP
4. Once  redistribution is completed , it
    a. Sorts data to group duplicates on each AMP
    b. Will remove all the duplicates on each amp and sends the original/unique value
 

P.s: There are cases when "Error : 2646 No more Spool Space
 " . In such cases try using GROUP BY.

GROUP BY

1. It reads all the rows part of GROUP BY
2. It will remove all duplicates in each AMP for given set of values using "BUCKETS" concept
3. Hashes the unique values on each AMP 
4. Then it will re-distribute them to particular /appropriate AMP's
5. Once  redistribution is completed , it
    a. Sorts data to group duplicates on each AMP
    b. Will remove all the duplicates on each amp and sends the original/unique value
 

Hence it is better to  go for
·                     GROUP BY  -  when Many duplicates
·                     DISTINCT        -  when few or no duplicates
·                     GROUP BY -  SPOOL space is exceeded

No comments:

Post a Comment

Thank you :
- kareem