You are here: Home > Knowledge Refreshers

KR editions 107 to 111


  • KR-107 (INDEX - I)

    We left of our discussion in EXPLAIN and matching, non-matching indexed scans...

    Before continuing with that we'll have to clear up some basics of indexes and the index structure (this is a modified version of KR-1):
  • An index is a sorted (ordered) set of pointers (pointing to the actual rows of the table). 
  • It is a separate object (it is not part of the table data). When we refer to data pages we refer to the physically stored tables. 
  • For any query we write to access data, DB2 (the optimizer of DB2) decides on how to access the data (thro a table scan/ index scan or a combination of both).
  • We can’t tell DB2 to choose a particular path but we can try to write our queries such that we use the index of the table (so that index scan is possible).
  • In index scan DB2 will read the index pages (which would contain fewer number of columns compared to the table and which would thus be smaller in size), retrieve information about where the corresponding row is present (i.e. which data pages it is present in) and retrieve those rows (this'll be clear when we see the structure of indexes).
  • Table scan means reading thro the data pages of the table (ex: if we have 20 rows in our table then DB2 has to read thro each and every row – this is fast for small tables). Here DB2 won't access the data through the index- it directly goes to the data pages.
  • The DB2 optimizer chooses the access path (i.e. how to access the data) based on the indexes present on that table.
  • Why can’t we have lots of indexes? Then we’ll have a lot of overhead – we’ll need additional space to store the new index, whenever we insert or delete rows in the table the index has to be updated (if you have lots of indexes then DB2 has to do a lot of background work), slower loading of tables and degradation in performance).

 

In the next edition we'll take a look at the index structure and how it helps access table data.




KR-108 (INDEX - II)

Let's take a look at what exactly an index contains (remember that it is a separate object - it's not part of the table). Let's say we have a table containing telephone number, customer name and payment amount. Assume that we've created an index on the telephone number. The telephone number is a unique field (there's an advantage of choosing a unique field on an index- explained later).

 

  • We would create a table and tell DB2 to create an index on the column telephone number. DB2 will automatically create the structure you see on the left side of the image.
  • An index is sorted (which is why the telephone number is stored in a sorted order in the index structure). But when we add records to the table we won't be entering data in a sorted manner (first we may enter tel. no. 100, then 200 and then 150). Each time we create a record, DB2 will update the index associated with the table as well. So when we insert a record with 150 as telephone number, DB2 will have to create a new position for this record in its index structure (in the correct place i.e. between 100 and 200 and also store the pointer to the row for 150).

 

Index scan and table scan:

  • An index is most effective if it matches the way you access the data. In our example, if we use a query to retrieve a row based on a telephone number it would be really fast. Why? It usually takes less time to search for a particular item in a sorted list than in an unsorted list (our index is sorted). But if you want to search for something within a very few items, then a linear search (i.e. checking the value of each and every element) might be faster compared to sorted searching.
  • In DB2 this linear searching is similar to a table scan (cycle thro all the records till you find a match). The sorted searching is equivalent to an index scan. The DB2 optimizer will decide which route is better based on the predicates we've provided in our SQL.

What's the advantage of having an index on a unique column?

  • Let's say that telephone numbers are not unique and let's assume that we have hundred records with the telephone number as 150 (and assume that the index is still on telephone number only). Now, in our index structure we'll have 100 rows (for the tel. no. 150) located between tel. nos. 100 and 200. If we had a query: SELECT * FROM CUST_TABLE WHERE TEL_NO=150 AND CUST_NAME='HARRY'. If DB2 uses our index structure it may be able to jump on a record with telephone number as 150 but it won't know whether the record is the one with cust_name as harry. To verify the cust_name it has to access the table. The problem is that it may need to access 100 rows (the worst case scenario) because in the index structure we don't have any sorting within the same telephone number. DB2 has no idea of knowing whether the first record of tel. no. 150 is harry or whether the last record in tel. no. 150 is harry. 
  • This gives us the answer to another question on indexes and performance...if you have many indexes, DB2 will need to update lots of indexes with the inserted record (and it will have to keep them sorted)- this adds to its overhead burden.
    Well, that's all for this edition and this week (we’ll delve into details about specifics of index structures in DB2 in the next edition)...





    KR-109 (INDEX - III)


    Basically, an index is similar to a table (the only difference being that here we’ll have 2 columns – one for the key and the other for the record location/ record ID).

  • DB2 doesn’t put the entire index in one place; instead it splits the index up into smaller parts (this would again help in faster search through the index and improve on efficiency). 
  • The index might be broken down as shown in the figure:
  • Every index will have a root page (this is level 1 of the index). The root page will contain links to the next level of pages. These pages will in turn point to another set of pages.
  • Leaf pages (the last level) are the pages where the key and record ID is stored. 
    Non-leaf pages are just another set of pointers within the index pages.
  • Each level below the root page is given a level number. In our figure we have 3 levels. 
  • What is the highest key mentioned in the figure? Suppose our key field in the table had values from 1 to 1000 (and if we had our index structure similar to that in the figure); then "highest key of page A" might be 500 while "highest key of page B" will be 1000. (It just means that key values between 1 and 500 are in Page A and the rest are in B). Similar logic applies to the non-leaf pages A and B (the idea is to break a large index into smaller pieces).
    The process would be: DB2 will first look at the root page; then it will try to find out where the given key is present in the Non leaf page A or B (using the highest key value present in the root page); then from the non-leaf page it would jump into the leaf page and retrieve the record ID (with this it can access the table data). Using the record ID DB2 can access the record in the table and retrieve the values specified in our SQL. 

Can an index have just 1 level? Yes. An index (depending on size) could have a root page which will contain the key and record IDs.

 


 

KR-110 (INDEX - IV)

Continuing with the index structure:

  • There are 2 types of index page structures available in DB2. Uptil version 4.1 DB2 only had TYPE-1 index. But from 4.1 onwards they introduced a TYPE-II index.
  • The main difference is that in TYPE-I row-level locking is not possible. If you lock a row in the tablespace, you’ll have to lock at least one entire index page.
  • In short, row-level locking wasn’t possible (which meant that you had more chances of timeouts and suspensions when there were simultaneous queries on the table). In TYPE-II indexes, row-level locking is possible (thus concurrency is increased, timeouts are decreased). 

We saw how indexes are stored, but what about tables themselves?

  • Each DB2 table is stored in a tablespace. A tablespace is a physical structure which usually consists of VSAM data sets.
  • All tablespaces are divided into pages (and each page will contain the rows of the table).
  • There are 3 types of tablespaces which one can create: 
    Segmented TS 
    Partitioned TS 
    Simple TS 
  • Segmented TS are divided into segments which are further divided into pages.
  • Partitioned TS are divided into partitions which are further divided into pages.
  • In a segmented TS every segment will contain rows pertaining to a single table (but 2 segments can have rows from 2 tables).

In partitioned TS all the partitions will contain rows of a single table



KR-111 (INDEX - V)

This is edition number NELSON (111)! A look at the mysterious optimizer (KR-5 reloaded)...


What is the optimizer in DB2?

  • DB2 uses “physical data independence” i.e. as programmers we needn’t worry about how data is being stored physically. We only need to worry about what data we want to access and not about how we are going to get it (for example: we don't need to know how to read through the vsam files where data is stored etc).
  • The optimizer is the heart of DB2. It determines how to access data by parsing through our SQL to find out what columns and tables it has to access. Then it checks up the DB2 catalog for more information on those columns and tables. As we have already seen there are index pages and there are table pages- it has to decide what it will use to satisfy our SQL.

    So, what does the DB2 catalog have? It is just like our normal shopping catalogs (except that here it is a bunch of tables) and it contains all the details about the tables in the database. It contains:
  • Current status of the table (number of rows, pages etc.)
  • Current status of the tablespace
  • Current status of the index (info about number of leaf pages, levels, whether there is any usable index for this query etc.)
  • Column information and many more things


Why is the optimizer a mystery? 
IBM has kept the working algorithm of the optimizer a secret. But there does exist some info about what it does backstage and how it works. 
An optimizer is cost-based. This means that the optimizer will try to minimize the cost. Basically, when we write an SQL there are bound to be many ways of accessing the data. The optimizer works out the cost for the different possibilities and chooses the path which it evaluates as being the least costly.



 

Go back to the main contents page