KRs 112-113: Optimizer costs

You are here: Home > Knowledge Refreshers

KRC++

KR editions 112 to 113

KR-112 (OPTIMIZER COSTS)

In the last edition we saw that the optimizer is cost-based. So, what are the costs involved in an SQL query? The following factors influence the overall cost calculated by the optimizer for an SQL statement:

CPU cost
I/O cost
The Catalog stats and the SQL statement

CPU cost is the time that would be required by the processor to process the information.

I/O (input/output) costs- if we could store everything in main memory then we wouldn’t have I/O costs. But main memory is costly and we can’t have GBs of main memory. So all our tablespaces and index pages would be stored in secondary storage devices (DASDs in mainframes and hard disks if it’s our PCs). Reading information from these devices is slow (slow compared to main memory) and we refer to this operation of reading or writing to secondary devices as I/O operation. The more data you have to read the more I/O you have to perform.

Generally we have something called a buffer (which is nearly as good as main memory) which helps reduce I/O (it is a temporary storage area; when we want to read one particular page from a secondary device the computer might pull out the next 3 pages as well and store it in the buffer – in case we later want to access any of those 3 pages then the computer needn’t go again to the I/O device to get the data. Of course, if we want to read some other page then it’ll have to do an I/O operation again).

Coming back to I/O costs; the formula to calculate this depends on the catalog statistics, table size, cost of work files (work files are temporary files needed for sorting data, for storing intermediate results etc). By applying all these factors the optimizer will arrive at a filter factor. The lower the filter factor, the lower is the cost and hence greater is the efficiency of the query (you could say that filter factor would roughly indicate the number of rows that DB2 will have to fetch. The lesser the better).

Generally if we use more predicates in our queries then the query will be more efficient (Predicates are the expressions we specify in the WHERE clause). Predicates help in filtering out rows and thus they help in reducing the amount of I/O operations needed to get data from the table.
A new term- Cardinality: Cardinality refers to the number of distinct values stored in a particular column. The cardinality of the index is usually stored in the DB2 catalog (and this is used in calculating the filter factor). Generally for an index column you’ll want the cardinality to be high (unique index columns are better than index columns with duplicates).
That’s all for this edition...suggestions welcome

KR-113 (Q&A + OPTIMIZER)

We had a couple of questions based on yesterday's KR;

Amit:

So, SQL optimizer would consider cost of I/O in determining whether a particular index will be used or not? Adding extra RAM - or making DASD access faster might change the way a querry is performed - is it true?

Improving the hardware efficiency will only reduce the time taken to physically retrieve data from the DASDs.
If the optimizer determines that 'n' number of pages have to be read, then the system has to access 'n' number of pages. So if hardware efficiency were improved the time taken to read these pages would be faster. The optimizer uses the catalog statistics for calculations (where we would have table size, number of rows, colums etc.) and it wouldn't know about the underlying hardware access.
So, the number of I/Os to be performed/ the amount of data that has to be retrieved as calculated by the optimizer will influence the I/O costs (it relies on the data from the DB2 catalog tables for this purpose). The efficiency of the hardware won't affect the optimizer.

Elan:

Cardinality just says number of distinct values..you told that it should be high ..Say if column AGE has values 2,2,3,3,4,4,5,5,6,6 Cardinality is 5 ..right? If it has 2,3,4,5,6 Cardinality is again 5.. So how do we relate cardinality with unique values in the column?

You're right; unique columns will be the same; the concept is just that you can use it with the total number of rows in the table...in your case:
1.) cardinality = 5 and number of rows = 10
2.) cardinality = 5 and number of rows = 5

(number of rows is also present in the catalog stats); so for an index you'd like it to be 5 and 5 than 5 and 10.....

If anyone has any other views/answers you can also share them with the group.

Continuing with our KR (KR-6 reloaded):

Access path- There are many options available to the optimizer to access table data and it has to choose the best one. Most importantly it should be able to decide when to use an index. Indexes are not used:

If no index exists on the table for the columns you want to access.
When optimizer feels that it can perform better without using the index.

In all these cases the optimizer will decide to do a sequential scan of the table (i.e. read through each and every page of the tablespace). But aren’t indexes created for improving performance? Why should the optimizer ignore them?

Ex: our tablespace might only have one page (i.e. our table is really small). To access this without using an index we’ll only need to perform 1 I/O. But if we use an index to search, we will need to perform at least one more I/O (because we have to fetch the index page from the storage device and then access the table data). If there are non-leaf pages then we’ll need even more I/Os.

That's all for this edition.....suggestions welcome,

Go back to the main contents page