KRs 1-4: Indexes in general and DB2 in particular

You are here: Home > Knowledge Refreshers

KRC++

KR-1 (the very first KR): Index part I

Since we work a lot on DB2 tables just felt it would be good to refresh our knowledge on indexes….

An index is a sorted (ordered) set of pointers (pointing to the actual rows of the table).
It is a separate object (it is not part of the table data).
For any query we write to access data, DB2 (the optimizer of DB2) decides on how to access the data (thro a table scan/ index scan or a combination of both). We can’t tell DB2 to choose a particular path but we can try to write our queries such that we use the index of the table (so that index scan is possible).
Table scan means reading thro the data pages of the table directly (ex: if you have 20 rows then you have read thro each and every row – this is fast for small tables).
Optimizer chooses the access path (i.e. how to access the data) based on the indexes present on that table.
Why can’t we have lots of indexes? Then we’ll have a lot of overhead – we’ll need additional space to store the new index, whenever we insert or delete rows in the table the index has to be updated (if you have lots of indexes then DB2 has to do a lot of work), slower loading of tables and degradation in performance).
More on indexes tomorrow.

KR-2: Indexes part II

In this edition we’ll continue with the discussion on indexes...

Let's take a look at what exactly an index contains (remember that it is a separate object - it's not part of the table).

Assume that we've created an index on the telephone number. The telephone number is a unique field (there's an advantage of choosing a unique field on an index- explained later).

What we actually do is we create a table and tell DB2 to create an index on the column telephone number. DB2 will automatically create the structure you see on the left side of the image.

An index is sorted (which is why the telephone number is stored in a sorted order in the index structure). When we add records to the table we won't be entering data in a sorted manner (first we may enter tel. no. 100, then 200 and then 150).
Each time we create a record, DB2 will update the index associated with the table as well. So when we insert a record with 150 as telephone number, DB2 will have to create a new position for this record in its index structure (in the correct place i.e. between 100 and 200 and also store the pointer to the row for 150).
An index is most effective if it matches the way you access the data. In our example, if we use a query to retrieve a row based on a telephone number it would be really fast. Why? It usually takes less time to search for a particular item in a sorted list than in an unsorted list (our index is sorted). But if you want to search for something within a very few items, then a linear search (i.e. checking the value of each and every element) might be faster compared to sorted searching.
In DB2 this linear searching is similar to a table scan (cycle thro all the records till you find a match). The sorted searching is equivalent to an index scan. The DB2 optimizer will decide which route is better.

What's the advantage of having an index on a unique column?

Let's say that telephone numbers are not unique and let's assume that we have hundred records with the telephone number as 150 (and lets assume that the index is still on telephone number only). Now, in our index structure we'll have 100 rows (for the tel. no. 150) located between tel. nos. 100 and 200. Let's say we write a query : SELECT * FROM CUST_TEL WHERE TEL_NO=150 AND CUST_NAME='GLENN'. If DB2 uses our index structure it may be able to jump on a record with telephone number as 150 but it won't know whether the record is the one with cust_name as glenn. To verify the cust_name it has to access the table. The problem is that it may need to access 100 rows (the worst case scenario) because in the index strucutre we don't have any sorting within the same telephone number. DB2 has no idea of knowing whether the first record of tel. no. 150 is glenn or whether the last record in tel. no. 150 is glenn.

This gives us the answer to another question on indexes and performance...if you have many indexes, DB2 will need to update lots of indexes with the inserted record (and it will have to keep them sorted)- this adds to overhead processing.

Well, that's all for this edition (I’ll delve into details about specifics of index structures in DB2 in the next edition).

KR-3: Indexes Part III

We discussed the general structure of an index. Now we’ll take a look at how DB2 really maintains an index (this part may not be directly useful for programming purposes but it’ll help in understanding some concepts of DB2).

Basically, an index is similar to a table (the only difference being that here we’ll have 2 columns – one for the key and the other for the record location/ record ID).
DB2 doesn’t put the entire index in one place; instead it splits the index up into smaller parts (this would again help in faster searching through the index and improve on efficiency).
Every index will have a root page (this is level 1 of the index). The root page will contain links to the next level of pages. These pages will in turn point to another set of pages.
Leaf pages are the pages where the key and record ID is stored. Non-leaf pages are just another set of pointers within the index pages. Thus if a particular key value is present in Index Page 1, then DB2 will first look at the root page; then it will find that the given key is present in the Non leaf page A (using the highest key value present in the root page); then from the non-leaf page it would jump into the leaf page 1 and retrieve the record ID (with this it can access the table data).
Each level below the root page is given a level number. In our figure we have 3 levels.

Can an index have just 1 level? Yes. An index (depending on size) could have a root page which will contain the key and record IDs.
That’s all for this edition…..

KR 4: Continuing with the index structure

There are 2 types of index page structures available in DB2. Uptil version 4.1 DB2 only had TYPE-1 index. But from 4.1 onwards they’ve introduced a TYPE-II index.
The main difference is that in TYPE-I row-level locking is not possible. If you lock a row in the tablespace, you’ll have to lock at least one entire index page (i.e. the one which points to that particular row). In short, row-level locking wasn’t possible (which meant that you had more chances of timeouts and suspensions when there were simultaneous queries on the table).
In TYPE-II indexes, row-level locking is possible (thus concurrency is increased, timeouts are decreased).
(Future versions of DB2 will not support TYPE-I indexes).

We saw about how indexes are stored, but what about tables themselves?

Each DB2 table is stored in a tablespace. A tablespace is a physical structure which usually consists of VSAM data sets.

All tablespaces are divided into pages (and each page will contain the rows of the table).
There are 3 types of tablespaces which you can create:

Segmented TS
Partitioned TS
Simple TS

Segmented TS are divided into segments which are further divided into pages.
Partitioned TS are divided into partitions which are further divided into pages.

In a segmented TS every segment will contain rows pertaining to a single table (but 2 segments can have rows of 2 tables).
In partitioned TS all the partitions will contain rows of a single table.

In the next edition we’ll unravel the mystery of the optimizer (and subsequent to that we’ll discuss some tips to improve on SQL).

Go back to the main contents page