Indexing data in DBMS
Indexing in the database can be similar to an index in front of a book that shows the title and its page number, DB stores these in B-Tree or LSM Tree structures to search through them as fast as possible.
Indexing improves the data lookup speed but not always. An index is the fastest when it requires lesser random disk access. We know that the data is stored on disk and therefore, Fragmentation of data is more important in performance. If the required data is scattered in different sections of a disk, it will be slower.
Just like in a library if the needed books are in 5 different sections, even if we have an index, it is slower to walk around.
I’d classify Indexing into two categories.
1. Clustered Indexing
Arrangement of each row data in disk, in an intelligent way based on a primary key order (It’s basically like an ordered file).
- To avail of this, the DB engine should be able to understand the ordering of primary keys.
- Index oriented table in Oracle is an example of such primary indexing.
- Using UUID or some unordered primary key can affect this indexing performance.
2. Map Indexing
A Database index with a mapping for a key to the disk location of its corresponding row data.
- Mostly B-tree, LSM tree structures are used to store this information since it’s faster to search.
- Frequently used related data can also be added to this index.
It is common to index the primary key by default (as a map index). And the indexes we create manually on databases are Map Indexes.
There are some interesting interactions related to the structure of these indexes.
- When B-tree is used to index, it performs some self-balancing upon adding more data.
- This results in an update on the disk sector.
- SSDs have shell life on sectors and these updates affect the life of an SSD.
- LSM trees don’t have such an update mode, so it works well on high insert volume.
To understand how the Indexing speeds up the process, we have to understand some types of scans DB does (scanning means to lookup data from disk)
Types of Scan
1. Index only scan
Every value needed by the query is already available on the index itself, So only the index is scanned.
For Example, In a book we can just look up the title of a chapter from the index.. we don’t need to jump to the page number
In a Database, Index created with CREATE INDEX index_name_idx ON table_name(col) include(col2);
and queried with SELECT col2 FROM table_name WHERE col = {val}
2. Index scan
A primary identifier for some tuple will be available in the Index and some further information needs to be fetched from another location corresponding to the identifier.
For example, In a book we look for the title and its corresponding page number, we have to jump to the page to see the 1st paragraph of the topic
In a Database, Index created with CREATE INDEX index_name_idx ON table_name(id);
and queried with SELECT col2 FROM table WHERE id = {val}
.
3. Sequential scan
Sequentially scan all pointers of all pages of the table. An index is generally not involved in this operation…
For example, In a book, we simply decide to ignore the Index and just turn each page one by one… And It’s useful if we are going to read everything in the right?
In a Database, This is the way it works when an index is not available or not relevant when doing Operations like SELECT *
4. Bitmap Scan
Scan index page once and generates a bitmap for the heap and then fetch data from the heap.
For example, In a book, we need to read chapters 4,9 and 14 so we just take another paper and write down the page numbers for these chapters and then jump to the page after looking at the paper we wrote. That piece of paper can be considered as a bitmap.
Please note that Bitmap is generated when working with a large number of rows, Index scan is used when the number of rows that satisfy the condition is lower.
In a Database, Index created with CREATE INDEX index_name_idx ON table_name(col);
(considering col is a non key value) and queried with SELECT col2 FROM table WHERE col = {val}
often results in bitmap scans when total number of rows where col = val is higher.
- Index only is the fastest, But we can’t really put all the data in the index, since it is expensive and will get inefficient ( consider a book with all its “content” stored at “index” Is it an index anymore ?)
- Index scan will be used when there is a less number of rows to work with that satisfies our query condition, If the number of rows is higher A bitmap is generated and scanned.
- The database is smart enough to estimate the cost for each kind of scan and decide what scan to use. In Postgres, this is done by a dedicated planner. We can inspect the plan with
EXPLAIN
before a query.