I had the opportunity to attend the IBM PoT event for DB2 10.5 BLU. One of the big features for the DB2 10.5 release is the BLU Acceleration and how it can improve on the analytical environment .The columnar store data structure is not the new concept among database vendor, however it is interesting to see how IBM implement and position itself among the competitors.
What is BLU stands for ?
The internal Code name for BLU was Blink Ultra. Now it is called BLU.
Currently, it is stands for Big Data,Lightning fast and Ultra-easy.
But you know it can changed any time .
What edition of DB2 10.5 support DB2 BLU Acceleration ?
BLU Acceleration is a feature of DB2 and just like any database features it is only supported on the certain edition. Currently, it is only supported in :
- Advanced workgroup server edition
- Advanced Enterprise Server edition
BLU does not support in the DPF, HADR and purescale at this point. ( 2013 August) , it is in the high priority list within IBM.
What type work load is suitable for BLU
Analytic workload is what BLU like to address. NOT for the OLTP environment .
BLU Design Principles – 7 Big ideas
- Simplicity and Ease of Use: DB2 with Standard SQL support. No need for change SQL code or application.
- Extreme Compression: DB2 use Huffman encoding to compress the data.
- Column Organized Tables: Data storage and retrieval is optimal for queries that touch fewer columns
- Data Skipping: Efficiently ignores data in a column that is not required
- Core-friendly Parallelism
- Parallel Vector Processing: Multi-core and SIMD (Single Instruction Multiple Data) parallelism
- Scan-friendly Memory Caching: New algorism to keep the HOT data in the memory.
Big Idea1: Simplicity and Ease of Use
- Require automatic storage
- SET DB2_WORKLOAD=ANALYTICS –> This will make all the table created by default as columnar store and set the below register key
- Instance Parameter INTRA_PARALLEL=YES Enables intra-partition parallelism
- Database Parameter DFT_TABLE_ORG=COLUMN
- DFT_DEGREE=ANY : Enables intra-partition parallelism to use all detected cores
- PAGESIZE=32768 : Default page size for table space or buffer pool if not specified
- DFT_EXTENT_SZ=4 : The default extent size for a table space
- SORTHEAP=[default+n] : Private sort heap [set higher than the default]
- SHEAPTHRES_SHR=[default+n] : Shared sort heap [set higher than the default]
- UTIL_HEAP_SZ=[default+n] : Utility heap [set higher than the default]
- CATALOGCACHE_SZ=[default+n] : System catalog cache usage of the dbheap [set higher than the default]
- AUTO_REORG=ON : Enables automatic REORGs for space reclamation
- WLM objects : are created and set to maximize throughput
- Still support organize by row table in the columnar store database
- db2convert utility to convert table from row base to column base.
Big Idea2: BLU Compression -use Huffman encoding
- Variable number of bits of each char depends upon frequency verse non compression always use the 8 bits for each char. This dramatically reduce the size of the data.
- Deco the data with B-tree
- Late Materialization – Ability to operate on the data while it is still compressed .
Encoded values do not need to be decompressed during evaluation
(Predicate examples: =, <>,<, >, >=, <=, between, etc.)
Column Organized Tables
Create the columnar table –> Load data .
During the load, analyze phase would put the data into the columnar table and create the synopsis table.
In the traditional Non BLU env, we would like to batch the load process, however, in the BLU, we should try to fetch as much as data in the first load so the DB2 engine can create the most tightly columnar table and synopsis table for look up.
- NO index, MQT table.MDC are needed. DB2 would create the internal index but no need to reorg.
- If the DB2_WORKLOAD=ANALYTICS , DB2 would automatically reclaim the space upon the delete.
- NO RUNSTATS
- REORG command can be used to reclaim the space in the columnar table manually.
Each (logical ) core can process individual column in the single query until CTQ operator ( in the query plan).
Parallel Vector Processing –> SIMD
Modern CPU feature SIMD ( Single Instruction Multiple Data)
DB2 BLU has new algorism to maintain HOT data in the memory. Each columnar table has correspond synopsis table. With absence of the index, db2 use synopsis table to access data to achieve the data skipping.
Data studio enhancement
Data studio 4.1 can be used to analysis whether the row table is suitable for columnar table and how much space it will be saved.