My thoughts and experiments.

© 2023. Dmitry Dolgov All rights reserved.

Evolution of tree data structures for indexing: more exciting than it sounds

0. How to read me?

I have to admit, my research blog posts are getting longer and longer. From one side I find it genuinely encouraging, because if one gets so much information just by scratching the topic, imagine what’s hidden beneath the surface! One university professor once said “what could be interesting in databases?”, and it turns out freaking a lot! On the other side it certainly poses problems for potential readers. To overcome them I would suggest an interesting approach: print this blog post out, or open it on your tablet/e-reader, where you can make notes with a pencil or markers. Now while reading it try to spot ideas particularly exciting for you and mark them. Along the way there would be definitely some obscure parts or questions, write them on the sides as well. You can experiment with the diagrams, changing or extending them, or just drawing funny faces. But do not read everything at once, have no fear of putting it aside for a while, and read in chunks that are convenient for you. Some parts could be skipped as the text is build out of relatively independent topics. The table of contents can help and guide you. Having said that we’re ready to embark on the journey.

1. Introduction

Whenever we speak about indexes, especially in PostgreSQL context, there is a lot to talk about: B-tree, Hash, GiST, SP-GiST, GIN, BRIN, RUM. But what if I tell you that even the first item in this list alone hiding astonishing number of interesting details and years of research? In this blog post I’ll try to prove this statement, and we will be concerned mostly with B-tree as a data structure.

Query optimizer and chess?

Do not be afraid, this short blog post is about databases and does not contain any unreasonable metaphysical references. In fact, it’s a result of a journey through couple of whitepapers and books with an unexpected intersection of two rather different fields. I will try to describe everything step by step, so that we can see if chess has anything in common with database query optimizer.

PostgreSQL at low level: stay curious!

0. How to read me?

Yes, I know, it’s a long text, and it was my conscious decision to write it in this way. But fear not! Imagine that you read a book, take a look at the introduction and first few interesting sections, think about it and then find time to read further. I hope I’ve left enough references, so if you don’t get some ideas you’ll be able to read more information about interesting parts. Or you can even skip some sections, since they are relatively independent. This table of contents will guide you:

1. Introduction

In mathematics it is too easy to concentrate very hard on one specific problem. If your methods are not good enough to handle the problem, you will eventually grind to a halt, baffled and defeated. Often the key to further progress is to stand back, forget about the special problem, and see if you can spot any general features of the surrounding area which might be of use.

Ian Stewart, Concepts of Modern Mathematics.

It’s not a secret that databases are damn complicated systems. And they tend to run on top of even more complicated stacks of software. Nowadays you will not surprise anyone (or at least not that much) by running your database on a Kubernetes cluster or inside a virtual machine. It’s probably still questionable whether it’s good and appropriate, but this approach is something we have to face — sometimes it’s at least convenient, sometimes it allows to be more resource efficient and sometimes it’s the only available infrastructure in a company.

Jsonb: few more stories about the performance

As such, there’s really no “standard” benchmark that will inform you about the best technology to use for your application. Only your requirements, your data, and your infrastructure can tell you what you need to know.

For already some time I can’t stop doing interesting/useful/weird (one at the time) benchmarks to reveal some details on how to apply document-oriented approach in the world of relational databases. Finally, I decided that I have a critical mass of those details to share in the form of blog post. So welcome to The Benchmark Club, where we’re going to discuss what it takes to create a fair performance comparison of different databases. As you may guess, the first rule of The Benchmark Club is to never share a reproducible benchmarks. But we identify ourselves as a badass engineers, so we’re going to break this rule today.