Perfect Datawarehouse

This post will outline the parameters and considerations that need to be made in order to create a basic data warehouse for any need.

With any data warehouse the aim is clear – The aim is to gather and process large amounts of data in the shortest space of time. If you can achieve this the the goal of a data warehouse has been reached.

Note that these parameters are not set in stone and some may need to be tweaked to fit your respective environment.

It is key to plan your data warehouse approximately from the word go. During these initial steps  CPU and I/O are key failing to gauge these correctly will result in poor performance later on when you load data or look to scale out you DW. Tuning and changing parameters can only do so much until the problem become under performing hardware.

While no environment is the same a good starting point as per Oracles recommendations are 200MB/s of I/O throughput per CPU core. Also the link between should be fiber channel to accommodate.

db_block_checksum – TRUE
db_file_multiblock_read_count – 0
db_block_size – 32k
dml_locks -1000
job_queue_processes – 2
log_buffer – 10485760 (10G)
log_checkpoint_interval -100000
log_checkpoint_timeout – 3600
Open_cursors – 1000
Processes – 500 (max number of conccurent users x2)
sessions – 1000 (Num of processes x2)
session_cached_cursors – 500
shared_pool_reserved_size – 40mb
shared_pool_size – 400mb or 0 for oracle to manage
sga_target -2G
sga_max_size – 2200mb
undo_retention – 90000

Statistics gathering is a key part to a data warehouse. Gathering statistics allows the DW to make educated decisions on the best explain plan thus reducing the query run time. Tuning a data warehouse with start schema enables is relativity simple a without diving to much into every option a simple configuration would be to create Bitmap indexes of each foreign keys columns in the fact table or tables.

dbms_stats.gather_schema_stats – GATHER AUTO
STAR_TRANSFORMATION_ENABLED – TRUE  (If using Star Schema)
OPTIMIZER_MODE = ALL_ROWS
QUERY_REWRITE_ENABLED = TRUE
QUERY_REWRITE_INTEGRITY – TRUSTED
HASH_JOIN_ENABLED – true
always_anti_join = hash
always_semi_join = hash

In order to take full advantage of the hardware provided I would recommend using parallel execution of queries.

In regards to parallel execution a good starting point is below:

Objects smaller than 200 MB should not use any parallelism

Objects between 200 MB and 5GB should use a DOP of 4

Objects beyond 5GB use a DOP of 32

These sizing should be changes to fit the hardware restrictions.

parallel_automatic_tuning = true

parallel_threads_per_cpu = Number of cores per cpu x2
cpu_count = xx – make sure correct

Until next time.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s