Understanding Star Schema
If your house is anything like mine, then it would be called Google. You have to search to get anything you want. Even if your house is well organized, to find a book by title or author you have to sift through the bookcase. However, if you decide to count all of your books regardless of their location in your house, it is more of a daunting effort. To make things complicated, what if you wanted to also categorize them as fiction, non-fiction etc. Imagine the chaos if several people decided to count all of your books at the same time. This is what happens when analytical reporting is based on a transaction processing system.
Online Transaction Processing (OLTP) is similar to picking a book. Analytical Reporting more closely resembles counting the books. In an OLTP system, a transaction processes a single or a small set of records, clearly indexed and identifiable with keys. The optimal database structure for this is a 3rd Normal Form where redundant data is eliminated and stored in separate entities with defined relationships. Reporting an analytical measure from 3rd normal from is akin to counting specific books in an entire house – clearly an inefficient approach.
The star schema design is a proven structure for optimizing analytical reporting. There is one central fact table connected to several dimensions. The dimensions have relationships only to the fact table and not to each other. The design resembles a multi-pointed star, hence the name star schema. Data is extracted, transformed, and loaded (ETL), from the source transactional system into a separate star schema
In our case, the count of books is the fact and book number with title, author, category (fiction, non-fiction) is the dimension. Since all facts are time contextual, a time dimension is mandatory in a star schema. Querying, slicing, dicing, and analyzing data in a star schema is intuitive and unambiguous. Measures, such as the number of fiction books as of last week, as of last month and so on, can be retrieved with a simple query. Database vendors have incorporated star schema tuning methods and tools into their database engines. Extensibility of a star schema is no longer an issue – a 100 million row fact table is not as intimidating as it was a few years ago.
Sophisticated Business Intelligence (BI) tools favor star schemas. The metadata modeling and therefore the SQL generated by the tool is more predictable and efficient. Numerous reporting queries that would be repeatedly executed against a normalized database are instead executed once in an ETL step to load the star schema. The performance gain is significant.
Besides the technical advantages, star schemas promote information sharing within an organization – a huge benefit. Since the measures are stored and not calculated on the fly, reconciliation and rollback is a simpler undertaking. Sharing the measures across different functions or applications allows users to analyze harmonized and consistent information. The star schema approach decouples the reporting solution from the data sources. Any changes like addition or consolidation of sources can be incorporated in the ETL procedure with minimum changes to the star schema.
There is a cost to creating and maintaining a star schema. And there are challenges in the design process. Stored procedures or other ETL steps need to be developed and scheduled in batch loads. Additional storage and resources are required to sustain the database hosting the star schemas. The benefits, however, as outlined above outweigh the cost in the long run. To this end, a reporting solution based on star schemas is the first step to achieving enterprise wide information democracy.