IBM Netezza: The Importance of Data Distribution for Optimal Performance
What is Netezza?
Netezza is a dedicated data warehouse appliance that uses a proprietary architecture called Asymmetric Massively Parallel Processing (AMPP) that combines open blade-based servers and disk storage with a proprietary data filtering process using field-programmable gate arrays (FPGAs). Netezza integrates a database, server, and storage, which are all interconnected by a powerful network fabric into a single, easy to manage system that requires minimal set-up and ongoing administration, leading to shorter deployment cycles and faster time to value for business analytics.
The focus of this article is defining how Netezza works and the importance of data distribution in achieving optimal performance in a Netezza system.
Let’s first look at several of the key benefits associated with Netezza systems, which include:
- Speed
10-100x faster query performance than traditional systems.
- Simple
Delivers high performance out of the box, with no indexing or tuning required.
- Scalable
Provides predictable and linear scalability based on user data capacity needs.
- Smart
Embedded In-Database Analytics platform allows integration of its robust set of built-in analytics with leading analytic tools from such vendors as Revolution Analytics, SAS, IBM SPSS®, Fuzzy Logix, and Zementis, on IBM Netezza’s core data warehouse appliances.
Before we discuss the data distribution mechanism, let us first understand how Netezza stores the data on disk. Each Snippet Processor in the Snippet Processing Unit (SPU) has a dedicated hard drive and the data on this drive is called a data slice. Each disk is divided into three partitions: Primary (user data), Mirror, and Temp (intermediate processing data). All the user data and temp space from each primary partition is copied to the mirror partition in another disk, which is called replication. Tables are split across SPUs and data slices and the data is stored in groups according to rows, while data is compressed according to identical column values (columnar compression).
The actual distribution of data across disks is determined by the distribution key listed as part of the table definition. There are two types of distribution methods, Hash and Random. If the DISTRIBUTE ON clause is not specified, the system defaults to using the first column as the Distribution column, using the hash algorithm.
The maximum number of columns that can participate in the distribution key is four. When the system creates records, it assigns them to a logical data slice based on their distribution key value.
The performance of the system is directly related to uniform distribution of user data across all of the data slices in the system. As depicted in the graphic below, the overall response time is impacted by the slowest performing data slice (S-Blade). In the diagram this would be slice 2.
Processing Skew is caused when the distribution key selected is a Boolean value, e.g. True/False or Y/N values. It will distribute data to any two data slices since it will have only two hash values.
A distribution method that distributes data evenly across all data slices is the single most important factor that can influence overall performance. Bad distribution methods and keys can result in uneven distribution of a table across data slices and SPUs (causing skew), causing data to be redistributed or broadcast, which results in bottlenecks and bad performance. As such, it is extremely important to correctly identify the right distribution key and method to ensure effective data distribution.
IBM Cognos and IBM SPSS with Netezza
The combination of IBM Cognos Business Intelligence and IBM SPSS Modeler with a Netezza enterprise data warehousing appliance ensures the fastest distribution of the best information to your entire business; accelerating decision-making for better business results.
While there are many other important topics that could be discussed in terms of an efficient Netezza implementation, the one illustrated in this article should provide a good foundation for some of your future development work. For more details on important considerations when selecting a distribution key and method, or if you would like to see a particular topic discussed or have specific questions, feel free to contact us.