You are here: Home > Knowledge Refreshers



KR 5: THE MYSTERIOUS OPTIMIZER


  • DB2 uses “physical data independence” i.e. as programmers we needn’t worry about how data is being stored physically. We only need to worry about what data we want to access and not about how we are going to get it. 
  • The optimizer is the heart of DB2. It determines how to access data by parsing through our SQL to find out what columns and tables it has to access. Then it checks up in the DB2 catalog for more information on those columns and tables. 

                                           
  • So, what does the DB2 catalog have? It is just like our normal shopping catalogs and it contains all the details you’ll need to know about the tables in the database. It contains: 

· Current status of the table (number of rows, pages etc.) 

· Current status of the tablespace 

· Current status of the index (info about number of leaf pages, levels, whether there is any usable index for this query etc.) 

· Column information 

  • Why is the optimizer a mystery? IBM has kept the working algorithm of the optimizer a secret. But there does exist some info about it (perhaps some of it leaked out!). 
  • Anyway, an optimizer is cost-based. This means that the optimizer will try to minimize the cost. Basically, when we write an SQL there are bound to be many ways of accessing the data. The optimizer works out the cost for the different possibilities and chooses the least cost. 

What does cost comprise of? - TO BE CONTINUED!!! 

(I don’t want to make knowledge refreshers too lengthy – it should be short and sweet). 



KR-6: OPTIMIZER CONT

 

In the last edition we saw that the optimizer is cost-based. So, what are the costs involved in an SQL query? 

The following factors influence the overall cost: 

  • CPU cost 
  • I/O cost 
  • Catalog stats 
  • SQL statement 

CPU cost is the time that would be needed by the processor to process the information. 

I/O (input/output) costs- if we could store everything in main memory then we wouldn’t have i/o costs. But main memory is costly and we can’t have GBs of main memory. So all our tablespaces and index pages would be stored in secondary storage devices (DASDs in mainframes and hard disks if it’s our PCs). Reading information from these devices is slow (slow compared to using main memory) and we refer to this operation as I/O operation. The more data you have to read the more I/O you have to perform. Generally we have something called a buffer (which is nearly as good as main memory) to help in reducing I/O (it is a temporary storage area - when we want to read one particular page from a secondary device the computer might pull out the next 3 pages as well and store it in the buffer – in case you need to later access any of those 3 pages then the computer needn’t go again to the I/O device to get the data. Of course, if you want to read some other page then it’ll have to do an I/O operation again). 

Coming back to I/O costs; the formula to calculate this depends on the catalog statistics, bufferpool size, cost of work files (work files are needed for sorting data, for storing intermediate results etc.). By applying all these factors the optimizer will arrive at a filter factor. The lower the filter factor, the lower is the cost and hence the greater is the efficiency of the query. (you could say that filter factor would roughly indicate the number of rows that DB2 will have to fetch. The lesser the better). 

· Generally if we use more predicates in our queries then the efficiency will be more. (Predicates are the expressions which we use in the WHERE clauses). Predicates would help in filtering out rows and thus it would help in reducing the amount of I/O operations needed to get the data from the table. 

· Cardinality: Cardinality refers to the number of distinct values stored in a particular column. The cardinality of the index is usually stored in the DB2 catalog (and this is used in calculating the filter factor). Generally for an index column you’ll want the cardinality to be high (i.e. unique index columns are better than duplicate index columns – we saw this in part-II). 

Access path- There are many options available to the optimizer to access table data and it has to choose the best one. Most importantly it should be able to decide when to use an index. Indexes are not used: 

  • If no index exists on the table for the columns you want to access. 
  • A type 2 index is needed (when you specify UR isolation) but only type 1 index is present. 
  • When optimizer feels that it can perform better without using the index. 

In all these cases the optimizer will decide to do a sequential scan of the table (i.e. read through each and every page of the tablespace). 

But aren’t indexes created for improving performance? Why should the optimizer ignore them? 

Ex: your tablespace might only have one page (i.e. your table is really small). To access this without using an index you’ll only need to perform 1 I/O. But if you use an index to search, you will need to perform at least one more I/O (because you have to fetch the index page from the storage device). If there are non-leaf pages then you’ll need even more I/Os.



KR-7: Tablescans + SQL


We’ve seen that in case the optimizer doesn’t use the index then it would have to go through a table scan. 

The following diagram depicts the basic steps involved in a TableScan: 

                                               


We learnt about DM, BM in training – hope you remember J …..anyway, these 3 form one major component of DB2. DM- data manager and BM is buffer manager. 
Does a tablescan have to go through all the pages of the tablespace? Can’t it skip some? 

There are 3 types of tablescans (depending on the type of tablespace used): 

Simple Tablespace Scan – scans all the pages of the tablespace because each page can have rows from multiple tables (a reason why if we use simple tablespaces we wouldn’t want to have multiple tables). 

Segmented Tablespace Scan – Read only the segment pages which contain rows of that particular table. 

Partitioned Tablespace Scan – in this case each partition stores only one table. So in this scan optimizer will have to read through the partition. 

A little SQL: 

The WHERE clause and HAVING clause are functionally similar. The difference is that WHERE operates on detail rows while HAVING operates on aggregated rows. An SQL having both of these clauses would have WHERE followed by GROUP BY and finally HAVING (where will select a bunch of rows, group by will aggregate the rows and having will filter out from the aggregated rows). 


KR-8 Indexed access


In this edition we’ll take a look at indexed-access (but before that a little about seq. prefetch). 

Sequential Pre-fetch: 

  • It’s the process by which DB2 fills up the buffer. The DM will ask the BM to do a sequential prefetch. This is done when the optimizer decides that it will need to do sequential processing. For example: in our tablescan, the optimizer knows that it’ll need to read all the pages and so it’ll ask for a sequential prefetch. The advantage of this is that in a single I/O operation DB2 will pull out many pages into the buffer (instead of doing repeated I/Os to get those pages). 
    Generally used in tables scans and also when scanning through index pages (leaf pages). 
  • So, now we know about table scanning (i.e. scanning without using the index). Next method is to use the index. 

Indexed-Access Methods: 

The SQL we use should satisfy 2 conditions: 

1.) At least one of the predicates should be indexable. Let’s say we have a column for curr_date which is indexed in the table. If we write an SQL as: 

select * from xxxx where curr_date = :ref-date + 10 

DB2 might not use the index. (Predicate is the expression that you use within the where clause and if we use arithmetic stuff within this then the index cannot be used – there are some other conditions also which we shall see later). 

2.) One of the columns must exist in an available index. 

The simplest form of indexed-access is the direct index lookup access. 

In this case if our table has an index with 3 columns then our SQL should also use the 3 columns in the predicate. In this case, DB2 would perform the following steps: 

1. Compare the value in the predicate to the values in the root (of the index page). 

2. Jump to the non-leaf pages (if any) 

3. Next locate the leaf pages containing the required index value. 

4. Jump to the leaf page and obtain the record ID. 

5. With the record ID get the row from the table. 

What if we use only 2 columns while our index has 3 columns? DB2 cannot go for a direct lookup. It would have to choose between 2 other types of index scans (matching index scan or non-matching index scan).


Go back to the main contents page