Star Queries in Oracle8
An Oracle Technical White Paper
A data warehouse should be designed to satisfy the strategic needs of a corporation. The capabilities of strategic planners and analysts should not be constrained by the functionality or design of their data warehouses. Oracle8 provides a wide breadth of data warehousing features which allows Oracle's customers to build the application that most closely suits their business needs.
This paper describes one feature of the Oracle8 universal data server that addresses a common business need in a data warehouse: high performance for queries on a star schema. This paper defines star schemas and star queries, describes the optimal query processing approaches for star queries, and explains how Oracle8 has been enhanced for star queries.
Why use a Star Schema?
Almost all of the data in a data warehouse originated from an operational system. However, even if the data exists in a relational schema in an operational system, a transaction-oriented schema is not necessarily appropriate for a data-warehousing application. A typical operation in on-line transaction process (OLTP) is a "transaction." In most applications, a transaction consists of the retrieval, insertion, or update of only a handful of records. For example, a new purchase order requires information about the customer, and the products being ordered. Operational systems are typically optimized to allow the application to quickly locate these individual records.
The typical data warehousing operation involves a much larger number of records. The end-user of a data warehouse is not interested in a single purchase order. Instead, the end-user will want to examine all of the purchase orders for a given week, or for a given region. Data warehouses are used for a variety of purposes, from simple reporting to more complex analysis and sophisticated forecasting. But each of these data warehousing functions share the same characteristic of analyzing large amounts of data.
A highly normalized schema offers superior performance and efficient storage for OLTP schemas; and star schema provides similar benefits for data warehouses.
What is a Star Schema?
A "star schema" is a natural data representation for many data warehousing applications. A star schema contains one very large table (known as the FACT table), and multiple smaller tables (called DIMENSION tables).
For example, a star schema in a grocery store environment could contain the following tables and attributes (this example is from Chapter 2 of Ralph Kimball's The Data Warehouse Toolkit).
Example star schema
The SALES table contains information about the sales for each product at each store on a given day (e.g., the date of the transaction and the amount of total sales). The SALES table is very large, since a grocery-store chain could easily have millions of such entries per day.
In contrast, the TIME, PRODUCT, STORE, and PROMOTION tables are very small relative to
In this schema, SALES is the FACT table; product sales are the basic units of analysis for this data warehouse. The other tables (TIME, PRODUCT, STORE, and PROMOTION) are the DIMENSION tables. These DIMENSION tables provide more details of the sales transactions, describing information about which products were sold, when and where the sales occurred, and what promotions might have influenced the sale.
This star schema is an intuitive representation of multidimensional datacube in a relational environment. Each DIMENSION table represents a business dimension while the FACT table contains the contents of the datacube, which in this case is the sales information for each product. For example, an end user may want to examine sales by store by week, where STORE and TIME are dimensions of the query.
Because star schemas can intuitively model the business issues facing a data warehouse; and because star schemas provide excellent query performance for business analysis in some relational database environments, star schemas are becoming increasingly common.
STAR QUERY PROCESSING: TECHNOLOGY EVOLUTION
A data warehouse implementor could build a star schema in any relational database. After all, a star schema is simply a collection of tables, and the queries on a star schema are standard SQL. However, it is not enough for a relational database to simply store the data in a star schema. The relational database must also provide efficient query access. In particular, it must have a method for supporting the efficient execution of star queries.
Many relational databases lack the optimized execution methods described above. Version 6 of the Oracle Server, which first went into production in 1988, lacked such methods. Nearly ten years later, at this white papers writing in spring of 1997, several major relational database products continue to lack star query support.
Although any relational database can process a star query and provide the correct results, the performance for star queries can be abysmal when using database systems that do not support an efficient access method for star queries. This is because many of these relational databases have simple 'query optimizers' that do not understand how to best process a star query. The query optimizer is the 'brain' of the query-processing engine. For each query, the query optimizer determines the indexes and join methods that will be used to process the query. A simple query optimizer may be more than sufficient for an OLTP application, but because star queries are considerably different than queries that are typically found in OLTP applications, the first-generation query optimizers often choose very poor strategies for star queries.
These first-generation solutions commonly choose to join the fact table to a single dimension tables as the first step of query processing. The second step would be to join the intermediate result from the first step to the second dimension table. This process continues until all tables have been joined. The shortcoming of this approach is that the database is processing large amounts of data from the fact table multiple times (once for each join of a dimension table). Because the fact table is always the largest table, this execution strategy can be very expensive.
With the release of Oracle7 in 1992, Oracle introduced an improved method for executing star queries. Oracle's then-new query optimizer, the cost-based optimizer, was much better suited for the demands of complex data warehousing queries.
Oracles cost-based optimizer not only has information about the query and the available indexes and join methods, but it also has information about the number of rows and blocks in each referenced table, the size of its indexes, and the distribution of data in each individual column.
When presented with a star query, the cost-based optimizer recognizes that the fact table is very large and expensive to access relative to the cost of accessing the much smaller dimension tables, and it considers this cost when choosing the best execution strategy for the star query.
For many star queries, the best Oracle7 execution strategy is to initially join all of the smaller dimension tables together using Cartesian product joins, and then join the fact table in the last phase of the query execution. A Cartesian product join is used for joining two unrelated tables. The SALES table and the PRODUCT table are related (by the product_key column), but the PRODUCT table and the STORE table are not related, so these two dimension tables must be joined using a Cartesian product join. This Cartesian-product-based strategy is often optimal because the expensive fact-table access occurs only once.
However, Oracle7s cost-based optimizer does not blindly apply this star query optimization to every query on a star schema. The cost-based optimizer always evaluates many possible approaches for executing a given query. In some cases, the star query approach described above is not applicable. For example, the dimension tables may be large, so that building a Cartesian product of the dimension tables may be an expensive operation. In these cases, the cost-based optimizer recognizes that the star query approach is not the best solution for this query, and will choose an alternative execution strategy.
In each release of Oracle7, the query optimizer ability has been enhanced to execute star queries. In Release 7.2, the performance of Cartesian product joins was improved through more intelligent caching techniques. In Release 7.3, the optimizer became more adept at explicitly identifying star queries, particularly in star schemas with large numbers of dimension tables.
This second-generation technique represented a substantial improvement over first-generation databases. By intelligently choosing to access the fact table's data exactly once, the Oracle7 cost-based optimizer consistently chooses star query strategies that are two orders of magnitude faster than first-generation solutions.
The second-generation solutions primarily entailed improvements to the query optimizer.
That is, Oracle7 chose the best possible method for evaluating a star query using its
existing join methods and indexing techniques. A further area for improved star query
performance is to utilize new methods for indexing
A join index is one possible solution for processing star queries. Although there are many variations of join indexes, these indexes all share two common characteristics: the index structure spans more than one table, and the index is used to improve the performance of relational joins.
The concept of a join index can be easily illustrated by a simple example involving two tables: EMP, containing employee records, and DEPT, containing descriptions about each department. These two table share a foreign-key/primary-key relationship on the DEPTNO key.
The following query might be used to find all of the employees in a given department:
This query could be evaluated using a join index. For example, there could be a join index on the dept.department_name column and the emp table. The index structure would list, for each possible value of dept.department_name, the rows of the corresponding employees in the emp table.
In a star schema, a join index can be created across a single fact table and all of its
dimension tables. A join index eliminates the need for generation of the Cartesian product
of the dimension tables during a query. Instead of computing all of the possible
combinations of the dimension tables rows, as in the Cartesian product approach, a
join index provides immediate access to only the relevant rows of the fact table.
Although the join index addresses some of the performance deficiencies of the Cartesian product method, it is not without limitations.
Complexity of Join Indexes
A join index is a complex structure, and is necessarily more expensive to build and
maintain than a
In some implementations, it is impossible to update any of the indexed columns of a join index. Although a data warehouse is theoretically static, every data warehouse DBA knows that occasional updates are a reality imposed by business requirements. For example, a data warehouse may be updated every quarter to reflect a re-organization of company's sales territory definitions. If a join-index prevents the updating of these tables, then even minor modifications to the data may require rebuilding the entire index or even unloading and reloading the entire star schema.
Multi-Column Join Index
Join indexes are not flexible. Because join indexes contain multiple columns, they have several shortcomings.
A data warehouse could attempt to address both of these issues by utilizing multiple, multi-key join indexes over different combinations of dimension tables. However, multiple indexes are often unacceptably costly both in terms of space utilization and machine resources for maintaining multiple indexes.
Join Indexes are not a Total Solution
Join indexes are not the best method for every possible star query. Indexes are used to efficiently identify the specific rows of interest. It is true that most queries will only examine a small percentage of the rows in the fact table, and in these cases, a join-index will be effective. But other queries may examine the entire fact tables (for example, what are the total sales for all products, regions, and time periods by product, region, and month?). A join index is not the best method for queries that span most of the fact table, and alternative techniques should be available for these queries. Some database products utilize join indexes, but have a very simplistic query optimizer that blindly applies the join-index method to virtually every database query. A better solution is to provide multiple algorithms for evaluating star queries.
A join index is also not well-adapted for hybridized star schema. A join index may span a fact table, and all of its immediate dimension tables. However, a join index does not directly address the performance issues of a snowflake schema -- a more complex variation of a star schema in which each dimension may be a collection of related tables rather than a single table.
Oracle8s star query algorithm represents a significant improvement over previous star query algorithms. Rather than relying on a join index, Oracle8s algorithm utilizes single-table bitmap indexes, and integrates this bitmap technology seamlessly into a new star query join method.
This new algorithm provides excellent performance, while utilizing less storage (bitmap
STAR QUERY PROCESSING IN ORACLE8
In order to use the Oracle8 star query algorithm, a data warehouse administrator would build bitmap indexes on each of the foreign-key columns of the fact table. In the example schema, the administrator would build single-column bitmap indexes on sales(time_key), sales(promotion_key), sales(product_key), and sales(store_key).
The data warehouse administrator may choose to build additional indexes in this schema. Any column that commonly appears in the "where-clause" of an end-users SQL query is a candidate for an index, particularly those columns in the larger dimension tables, such as product. In this example schema, the administrator may choose to build B-tree indexes on product.sku_number and product.sku_description, and bitmap indexes on each of the other attributes of product.
The store, time, and promotion dimension tables are very small. Since these small tables will often be cached or, even if uncached, will only require one or two I/O operations to scan the entire table, it is unnecessary to index the tables.
How it Works: An Example
Oracle8 processes a star query using two basic phases. The first phase retrieves exactly the necessary rows from the fact table. Because this retrieval utilizes bitmap indexes, it is very efficient.
The second phase joins this "result set" from the fact table to the dimension tables. Below is an example of how an end-user may query this data warehouse:
"What were the sales and profits for the grocery department of stores in the west and southwest sales districts over the last three quarters?"
This is a simple star query. The SQL generated by an end-user tool could look like:
As described above, Oracle8 will process this query in two phases. In the first phase, Oracle8 will use the bitmap indexes on the foreign-key columns of the fact table to identify and retrieve the only the necessary rows from the fact table.
That is, Oracle8 will retrieve the rows from the fact table using essentially the following query:
Oracle8s new patent-pending algorithm is often called a "star transformation," since the original star query has been transformed into this subquery representation.
This method of accessing the fact table leverages the strengths of Oracles bitmap indexes. Intuitively, bitmap indexes provide a set-based processing scheme within a relational database. Oracle has implemented very fast methods for doing set operations such as AND (an intersection in standard set-based terminology), OR (a set-based union), MINUS, and COUNT.
In this star query, a bitmap index on store_key is used to identify the set of all rows in the fact table corresponding to sales in the West sales district. This set is represented as a bitmap (a string of 1's and 0's that indicates which rows of the fact table are members of the set).
A similar bitmap is retrieved for the fact-table rows corresponding to the sale in the Southwest sales district. The bitmap OR operation is used to combine this set of Southwest sales with the set of West sales.
Additional set operations will be done for the time dimension and the product dimension. At this point in the star query processing, there are three bitmaps: each bitmap corresponds to a separate dimension table, and each bitmap represents the set of rows of the fact table that satisfy that individual dimensions constraints.
These three bitmaps are combined into a single bitmap using the bitmap AND operation. This final bitmap represents the set of rows in the fact table that satisfy all of the constraints on the dimension table; this is the "answer set," the exact set of rows from the fact table needed to evaluate the query. Note that none of the actual data in the fact table has been accessed; all of these operations rely solely on the bitmap indexes and the dimension tables. Because of the bitmap indexes patented, compressed data representations, the bitmap set-based operations are extremely efficient.
Once the answer set is identified, the bitmap is used to access the actual data from the sales table. Only those rows that are required for the end-users query are retrieved from the fact table.
The second phase of this query is to join these rows from the fact table to the dimension tables. Oracle8 will use the most efficient method for accessing and joining the dimension tables. Many dimension are very small, and table scans are typically the most efficient access method for these dimension tables. For large dimension tables, table scans may not be the most efficient access method. In the example above, a bitmap index on product.department may be used to quickly identify all of those products in the grocery department. Oracle8s cost-based optimizer will automatically determine which access method is most appropriate for a given dimension table, based upon the cost-based optimizers knowledge about the sizes and data distributions of each dimension table.
The specific join method (as well as indexing method) for each dimension tables will likewise be intelligently determined by the cost-based optimizer. Hash joins, a sophisticated join method added to Oracle in Release 7.3, is often the most efficient algorithm for joining the dimension tables. The final answer is returned to the user once all of the dimension tables have been joined. The query technique of retrieving only the matching rows from one table and then joining to another table is commonly known as a "semi-join."
One particularly notable trait of the example query is that not every dimension table is referenced. The promotion dimension is entirely omitted. In Oracle7, as well as other database products, this star query would be executed using a multi-column index on the fact table. The typical multi-column index would contain all four dimension columns. For this particular query, the multi-column index would suboptimal, since it contains an entirely extraneous column, promotion_key. The data warehouse administrator could, of course, create a different multi-column index that only contains the fact-table columns store_key, product_key, and time_key. However, space considerations would prevent an administrator from building all possible combinations of multi-column indexes. Thus, there will always be some star queries that perform poorly in a multi-column approach, because these queries lack the optimal index.
With the Oracle8 approach, star queries always use the most appropriate indexes. Since the indexes on the fact table are single-column bitmap indexes, the optimizer will only use indexes for those columns that are specified in the query. No extraneous indexed columns will be used in the Oracle8 star query algorithm.
Another interesting characteristic of this query is that no columns are selected from the product table. That is, we are using the product table to restrict sales to the Grocery department, but none of the columns from the product table appear in the SELECT clause of the SQL query. In this case, Oracles cost-based optimizer will recognize that it is unnecessary to join the product table. Oracle applies the constraint "Product.department = Grocery" by using bitmap index on the product_key column of the sales table. However, Oracle will not join the product table in the second phase of the query, since no more data needs to be retrieved from the product table.
Even this simple example query demonstrates the robustness of Oracle8s star query algorithm. This example illustrated how Oracle efficiently handles cases where not all dimensions are accessed and cases where not all dimensions are selected.
Complex Star Queries
The Oracle8 algorithm can handle other complex cases, such as an "inside-out" star query. Many star queries are "outside-in," that is, the end user places constraints on the dimension tables (the outside of the star schema) and retrieves data from the fact table. The above example, in which the end user was looking for the total sales for a given department, time period, and region, is a typical "outside-in" query. However, an end user may also want to do an "inside-out" query. For example, suppose that an analyst wanted to know all of the products that were sold to more than 10 customers on a single day in a given promotion. Then, the analysts query could look like:
The first phase of processing this query is to identify all of the relevant rows of the sales table. This phase will not only take into account the constraints on the time_key and the promotion_key, but will also consider the constraint on the customer_count column. Morever, the data warehouse administrator could choose to create a bitmap index on the customer_count column, and Oracle would utilize this bitmap index in conjunction with the bitmap indexes on time_key and promotion_key. This allows for extremely efficient access to the fact table for "inside-out" queries.
Complex Star Schemas
Oracle8s algorithm also supports different varieties of star schema designs. For example, some implementors may have a collapsed dimension table. That is, a logical dimension is stored as one or two columns in the fact table, rather than as a separate dimension table. In the above schema, the administrator could choose not to include a time dimension table. Instead, all queries could reference the time_key columns of the Sales table, rather than the columns of the Time dimension table. The Oracle8 algorithm would perform equally well without an explicit time dimension table, since all of the date constraints are applied to the time_key column of the sales table rather than the time dimension table, and a bitmap index on the time_key column of the fact table could continue to be utilized.
The Oracle8 algorithm can also efficiently handle snowflaked dimensions. Suppose that the product dimension was no longer a single table, but two tables:
Simple snowflake schema
The example query on this modified schema would need to be modified to include both the product and the product_department tables:
Oracle will recognize this snowflake schema, and it will treat each snowflaked dimension as a single logical dimension. That is, when accessing the sales table, Oracle will effectively be doing:
Oracles sophisticated query transformation is designed to accommodate a snowflake schema by identifying all of the tables that constitute a single logical dimension.
Intelligent Query Optimizer
For every query, Oracles cost-based optimizer will carefully choose the most
When the cost-based optimizer determines that the star-query algorithm is the best approach for a given query, the optimizer must still intelligently choose the best way to apply the star query algorithm. As mentioned previously, the optimizer will choose the most efficient methods for accessing and joining each dimension table. Additionally, the optimizer will also weigh the relative benefits of including each dimension table in the first phase of query processing. If a given dimension table does not sufficiently constrain the fact table, then the optimizer may determine that it is more efficient to defer that dimension table's access until the second phase of the algorithm. Unlike a join-index approach, the Oracle8 star query algorithm is flexible so that every component of the star query algorithm can be tuned, and the cost-based optimizer intelligently and transparently applies these additional optimizations.
The cost-based optimizer has been considerably enhanced within Oracle8. In Oracle7, the cost-based optimizer analyzes a star-query by considering all of the possible orders for joining the dimension tables together; this characteristic limits the Oracle7 optimizer's effectiveness for star queries with large numbers of dimension tables. The Oracle8 optimizer does not depend upon generating many possible join orderings. Since the Oracle8 star-query algorithm joins all of the dimension tables to the fact table in a single phase (the first phase of the star query algorithm), the optimizer does not need to consider all possible join-orderings; the optimizer only needs to consider the set of dimension tables that are associated with a given fact table. This approach simplifies the optimizer's algorithm for generating star query strategies and consequentially allows that the optimizer to generate more efficient query strategies.
The cost-based optimizer will also recognize star queries on schema that contain multiple fact tables. An example of a multiple-fact-table schema is a system used to track both inventory and sales. There might be separate fact tables for inventory shipments and for sales, although these two fact table may shared several dimension table (such as product). In such a system, it would not be uncommon for a single query to examine multiple fact tables. For these queries, the optimizer be analyzing mulitple "magic sets" (where each "magic set" consists of a fact table and its associated dimension tables), and will choose to apply the Oracle8 star query optimization to such a set in the query.
Oracles implementation of cost-based optimizer has been a key contribution to its query-processing improvements. Not only does the cost-based optimizer vastly improve query performance for the complex queries typically found in data warehouses, but the cost-based optimizer is an important component of Oracles extensible infrastructure. New query techniques will continue to be added quickly and seemlessly, since the cost-based optimizer provides a framework for appropriately applying each new technique.
All aspects of this Oracle8 star query algorithm can be parallelized. This includes the
Any star query in Oracle8 can be executed in parallel (this parallelization has been implemented on both partitioned tables and non-partitioned tables). As with all parallelized queries within Oracle, the query optimizer dynamically determines the appropriate degree of parallelism.
BENEFITS OF NEW STAR QUERY ALGORITHM
The Oracle8 star query algorithm provides:
The Oracle8 star-query algorithm represents the next generation in star query
processing. The Oracle8 algorithm is truly advanced technology. While join-index solutions
for star queries were implemented in commercial products over five years ago, the concept
of applying a semi-join to star queries was only
"Star Query Processing in the Oracle7 Server," Oracle White Paper, March 1996.
"Bitmap Indexes in the Oracle Server," Oracle White Paper, October 1995.
"Oracle8 Server Tuning," Oracle documentation, June 1997.
"Oracle8 Server Concepts," Oracle documentation, June 1997.
Kimball, Ralph, The Data Warehouse Toolkit, John Wiley & Sons, 1996
O'Neil, P. E., Graefe, G., "Multiple-table joins through bitmapped join indexes," SIGMOD Record, Vol. 24, No. 3, September 1995.
Copyright Á Oracle Corporation 1997
This document is provided for informational purposes only, and the information herein is subject to change without notice. Please report any errors herein to Oracle Corporation. Oracle Corporation does not provide any warranties covering and specifically disclaims any liability in connection with this document.
Oracle is a registered trademark and Enabling the Information Age, PL/SQL, Oracle 8, and Oracle7 are trademarks of Oracle Corporation.
All other company and product names mentioned are used for identification purposes only and may be trademarks of their respective owners.