Tuesday, December 18, 2012

What is Slowly Changing Dimensions (SCD) ?


What is Slowly Changing Dimensions (SCD) ?
                                                                                                                                                                       
Slowly Changing Dimensions

             Dimensions that change over time are called Slowly Changing Dimensions. For instance, a product price changes over time; People change their names for some reason; Country and State names may change over time. These are a few examples of Slowly Changing Dimensions since some changes are happening to them over a period of time.

           Slowly Changing Dimensions are often categorized into three types namely Type1, Type2 and Type3. The following section deals with how to capture and handling these changes over time.
The "Product" table mentioned below contains a product named, Product1 with Product ID being the primary key. In the year 2004, the price of Product1 was $150 and over the time, Product1's price changes from $150 to $350. With this information, let us explain the three types of Slowly Changing Dimensions.

Product Price in 2004:
Product ID(PK) Year Product Name Product Price
1 2004 Product1 $150


1.SCD TYPE1(Slowly Changing Dimension) : contains current data.

2.SCD TYPE2(Slowly Changing Dimension) : contains current data + complete historical data.

3.SCD TYPE3(Slowly Changing Dimension) : contains current data + one type historical data.

Type 1: Overwriting the old values.

           In the year 2005, if the price of the product changes to $250, then the old values of the columns "Year" and "Product Price" have to be updated and replaced with the new values. In this Type 1, there is no way to find out the old value of the product "Product1" in year 2004 since the table now contains only the new price and year information.

Product
Product ID(PK) Year Product Name Product Price
1 2005 Product1 $250

Type 2: Creating an additional record.

               In this Type 2, the old values will not be replaced but a new row containing the new values will be added to the product table. So at any point of time, the difference between the old values and new values can be retrieved and easily be compared. This would be very useful for reporting purposes.
Product

Product ID(PK) Year Product Name Product Price
1 2004 Product1 $150
1 2005 Product1 $250
             The problem with the above mentioned data structure is "Product ID" cannot store duplicate values of "Product1" since "Product ID" is the primary key. Also, the current data structure doesn't clearly specify the effective date and expiry date of Product1 like when the change to its price happened. So, it would be better to change the current data structure to overcome the above primary key violation.

Product
Product ID(PK) Effective
DateTime(PK) Year Product Name Product Price Expiry
DateTime
1 01-01-2004 12.00AM 2004 Product1 $150 12-31-2004 11.59PM
1 01-01-2005 12.00AM 2005 Product1 $250

                      In the changed Product table's Data structure, "Product ID" and "Effective DateTime" are composite primary keys. So there would be no violation of primary key constraint. Addition of new columns, "Effective DateTime" and "Expiry DateTime" provides the information about the product's effective date and expiry date which adds more clarity and enhances the scope of this table. Type2 approach may need additional space in the data base, since for every changed record, an additional row has to be stored. Since dimensions are not that big in the real world, additional space is negligible.
  
Type 3: Creating new fields. 

                 In this Type 3, the latest update to the changed values can be seen. Example mentioned below illustrates how to add new columns and keep track of the changes. From that, we are able to see the current price and the previous price of the product, Product1.

Product
Product ID(PK) Current
Year Product
Name Current
Product Price Old Product
Price Old Year
1 2005 Product1 $250 $150 2004

            The problem with the Type 3 approach, is over years, if the product price continuously changes, then the complete history may not be stored, only the latest change will be stored. For example, in year 2006, if the product1's price changes to $350, then we would not be able to see the complete history of 2004 prices, since the old values would have been updated with 2005 product information.
Product
Product ID(PK) Year Product
Name Product
Price Old Product
Price Old Year
1 2006 Product1 $350 $250 2005


Example: 
            In order to store data, over the years, many application designers in each branch have made their individual decisions as to how an application and database should be built. So source systems will be different in naming conventions, variable measurements, encoding structures, and physical attributes of data. Consider a bank that has got several branches in several countries, has millions of customers and the lines of business of the enterprise are savings, and loans. The following example explains how the data is integrated from source systems to target systems.

Example of Source Data

System Name Attribute Name Column Name Datatype Values
Source System 1 Customer Application Date CUSTOMER_APPLICATION_DATE NUMERIC(8,0) 11012005
Source System 2 Customer Application Date CUST_APPLICATION_DATE DATE 11012005
Source System 3 Application Date APPLICATION_DATE DATE 01NOV2005
In the aforementioned example, attribute name, column name, datatype and values are entirely different from one source system to another. This inconsistency in data can be avoided by integrating the data into a data warehouse with good standards.

Example of Target Data(Data Warehouse)  

Target System Attribute Name Column Name Datatype Values
Record #1 Customer Application Date CUSTOMER_APPLICATION_DATE DATE 01112005
Record #2 Customer Application Date CUSTOMER_APPLICATION_DATE DATE 01112005
Record #3 Customer Application Date CUSTOMER_APPLICATION_DATE DATE 01112005
In the above example of target data, attribute names, column names, and datatypes are consistent throughout the target system. This is how data from various source systems is integrated and accurately stored into the data warehouse.

2 comments:

Thank you :
- kareem