ETL-Life Cycle

The typical real-life ETL cycle consists of the following execution steps:

1. Cycle initiation

2. Build reference data

3. Extract (from sources)

4. Validate

5. Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates)

6. Stage (load into staging tables, if used)

7. Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair)

8. Publish (to target tables)

9. Archive

10. Clean up

Four-layered approach for ETL architecture design

Functional layer: Core functional ETL processing (extract, transform, and load).
Operational management layer: Job-stream definition and management, parameters, scheduling, monitoring, communication and alerting.
Audit, balance and control (ABC) layer: Job-execution statistics, balancing and controls, rejects- and error-handling, codes management.
Utility layer: Common components supporting all other layers.

Use file-based ETL processing where possible

Storage costs relatively little
Intermediate files serve multiple purposes:
Used for testing and debugging
Used for restart and recover processing
Used to calculate control statistics
Helps to reduce dependencies - enables modular programming.
Allows flexibility for job-execution and -scheduling
Better performance if coded properly, and can take advantage of parallel processing capabilities when the need arises.

Use data-driven methods and minimize custom ETL coding

Parameter-driven jobs, functions, and job-control
Code definitions and mapping in database
Consideration for data-driven tables to support more complex code-mappings and business-rule application.

Qualities of a good ETL architecture design:

Performance
Scalable
Migratable
Recoverable (run_id, ...)
Operable (completion-codes for phases, re-running from checkpoints, etc.)
Auditable (in two dimensions: business requirements and technical troubleshooting)

Saturday, August 11, 2012