Netezza – Design Best Practices & Guidelines
Today’s big data challenges for both transactions and analytics are increasing demands on data systems. Traditional data warehouses sometimes struggle as they are often NOT designed to meet the demands of advanced analytics on big data. That’s where solutions like Netezza come in.
IBM PureData for Analytics (formerly Netezza) is a data warehouse appliance that has a purpose-built analytics engine and an integrated database, server, and storage. With simple deployment, out-of-the-box optimization, no tuning, and minimal ongoing maintenance, the IBM Netezza data warehouse appliance has the industry’s fastest time-to-value and lowest total cost of-ownership.
In order to leverage the Netezza box to its fullest potential and achieve optimum performance, we recommend the best practices and guidelines listed below:
- The distribution of the data across the various disks is the single most important factor that can impact performance. Consider the below points when choosing the distribution key:
- Column(s) with high cardinality.
- Column(s) that will frequently be used in join conditions.
- Avoid using a boolean column which causes data skew.
- Avoid distributing the tables on columns that are often used in the where clause as it will cause processing skew. Date columns would be an example of where not to use this, especially in DW environments.
- Select the common key between the Dim and Fact tables if possible; if not select the key to ensure that the larger table (Fact) is not redistributed.
- Choose Random Distribution only as the last resort as it will more often lead to a table being redistributed or broadcasted. This is okay for a small table but will impact performance if done to large tables.
- Use Char(x) instead of Varchar(x) when you expect the data to be a fixed length as this not only helps to save disk space but also helps performance due to reduced I/O.
- Where possible, use the NOT NULL constraint for columns in the tables, especially for columns that are used in where clauses (Restriction) or join conditions.
- Use Same Data Type and Length for columns that are often used for joining tables.
- The largest table should be joined last in the query.
- Use joins over sub queries.
- Create materialized views for vertically partitioning small sets of columns that are often used in queries. The Optimizer automatically decides when to use the MV or the underlying table.
- Sorting the MV using the Order By clause on the most restrictive column will have the effect of adding an index.
- It creates additional overhead during Inserts/Updates/Deletes.
- It uses additional disk space.
- It requires regular maintenance.
- Create thin and as few MV’s as possible since:
- Prior to a full backup it is Important to run GROOM to ensure all unwanted tables and records, like deletes and updates, are cleared prior to the backup.
- After running GROOM, run Generate Statistics to ensure the Optimizer has the most updated statistics.
- When you use Drop External Tables it only clears the relevant entries in the System Catalog tables and does not delete the physical files on the disk (generally not on the host). These will have to be manually deleted.
- If the data does not change much on a daily basis, it is recommended to perform a daily incremental backup and a weekly full backup.
- A full restore cannot be done against an existing database. You need to be drop it or create a new database for the full restore to work.
- Use Security – Control at 2 levels. Create relevant groups and associate users to them accordingly:
- Primary Level – Linux user level
- Secondary Level – Database level
- Use Explain Verbose to check the Detailed Query Plan. Look for Broadcast/Redistribute on the bigger tables. If this is occurring then you should either rewrite the query or reconsider revisiting the Distribution Key to optimize performance.
IBM Netezza Analytics is designed to quickly and effectively provide better and faster answers to the most sophisticated business questions. It is IBM Netezza’s most powerful advanced analytics platform that provides the technology infrastructure to support enterprise deployment of in-database analytics. The analytics platform allows integration of its robust set of built-in analytics with leading analytic tools from such vendors as Revolution Analytics, SAS, IBM® SPSS®, Fuzzy Logix, and Zementis on IBM Netezza’s core data warehouse appliances. It enables analytic enterprises to realize significant business value from new business models and helps companies realize both top-line revenue growth and bottom-line cost savings.
The Ironside Information Management team can help you get Netezza set up in your analytics environment and can advise you on the best deployment strategies for this powerful platform. Let us know how we can help you improve your data handling infrastructure.