A recent question on StackOverflow brought up a topic that is not well covered in the official HANA documentation: the index structures used in HANA’s row-store.
The term CPB tree shows up in several locations in the HANA documentation (e.g. here) but there is no good explanation of what it means and what the HANA admin, developer, or user should do with it.
I answered that question but thought it might be worthwhile to keep a version of the answer on this blog, too.
So, here we go:
Where to begin here?
B-trees are very intensely studied and developed data structures, so pointing to a single document that explains all aspects relevant to this question and SAP HANA is a bit difficult.
Maybe it helps to unpack the term first:
This basically means, the B-tree index and leaf nodes do not contain the full strings for keys. Instead, the parts of the key-strings that are common among the keys (the prefixes) are stored separately. The leaf and index nodes then only contain
- the pointer to the prefix
- a kind of “delta” that contains the remaining key (this is where the partial key from the pkB-tree comes in)
- and a pointer to the data record (row id)
So, now we know that HANA uses compressed B-tree indexes (for row-store tables and for data that can be expressed as strings).
Why is this important for an in-memory database like HANA?
In short: memory transfer effort between RAM and CPU. The smaller the index structure, the more of it can fit into the CPU caches. To traverse (go through) the index, fewer back-and-forth movements of data have to be performed.
It’s a huge performance advantage.
This is complemented with specific “cache-conscious” index protocols (how the index structure is used by the HANA kernel) that try to minimize the RAM-CPU data transfers. The OLFIT protocol (optimistic, latch-free index traversal) is an important example of such protocols in HANA.
All this is an overly simplified explanation and I hope that it helps to make more sense nevertheless.
If you want to “dive deeper” and start reading academic papers around that topic then Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems by Prof. Sang K. Cha. et al.
This is the same Sang K. Cha that created P*Time, an in-memory (row-store) DBMS in the early 2000s.
This P*Time has been, rather well-known, acquired by SAP (like so many other DBMS software products companies… Sybase… MaxDB… OrientDB…) and the technology has been used as a research base for what would become SAP HANA.
Nowadays, there is only a small part of P*Time still in SAP HANA and it is mostly reduced to the concepts and algorithms and not so much expressed in actual P*Time code.
What does that mean for the HANA user?
All in all, for the user of HANA (developer, admin, data consumer) the specifics of this index implementation hardly matter as none of them can interact with the index structure directly.
What matters is that this index takes modern server systems (many cores, large CPU caches, lots of RAM) and extracts great performance from them, while still allowing for “high-speed” transactions.
Surprisingly enough, SAP chose to provide options in the CREATE|ALTER INDEX commands and declaration files (e.g, .hdbtable). Fortunately though, HANA picks the best fitting index type based on the column types to be indexed.
Further reading, links, and resources
- Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems, Sang K. Cha et al.
- What is Optimistic Latch-Free Index Traversal in SAP HANA Architecture?
- P*TIME: Highly Scalable OLTP DBMS for Managing Update-Intensive Stream Workload, Sang K. Cha et al.
- Sang Kyun Cha publications
- Sang Kyun Cha homepage
- Making B+- trees cache conscious in main memory, Jun Rao et al.
There you go, now you know!