Demand the impossible: rigorous database benchmarking

29 Dec 2023

0. Table of content

Introduction

Common vs particular
Known vs unknown
Database model
Smooth transition away from hand waving

PostgreSQL

Misconfiguration
Choose the right scale
Time dimension
It’s not the database
Load generator

Statistics

Even more smooth transition to statistics
A century old problem
Big problem in CS
How many runs do we need?
Randomized testing
Time average or ensemble average?

Summary
References

1. Introduction

Everyone knows benchmarking is hard (and writing about benchmarking is double as hard), but have you ever asked “why”? What makes benchmarking, and performance evaluation in general, so error-prone, so complicated to get right and so easy to screw up? Those are not new questions, but there seem to be no definitive answer, and for databases things are even more grim – yet we could speculate, hoping that our speculation can help us learn something along the way. I don’t think you would read anything new below, in fact many things I’m going to talk about are rather obvious – but the process of bringing everything together and thinking about the topic is valuable by itself.

There could be at least few reasons why it’s so easy to fail trying to understand performance of a database system, and they usually have something to do with the inherent duality:

It’s necessary to combine expertise from both the domain specific area and general analytics expertise.
One have to take into account both known and unknown factors.
Establishing a comprehensive mental model of a database is surprisingly hard and could be counter-intuitive at times.

Running fast and slow: experiments with BPF programs performance

30 Dec 2022

1. Introduction

My own personal white spot regarding BPF subsystem in Linux kernel was always programs performance and an overall introspection. Or to formulate it more specifically, I wasn’t sure if there is any difference in how we reason about an abstract program performance versus a BPF program? Could we use the same technics and approaches?

You may wonder why even bother when BPF programs are so small and fast? Generally speaking you would be right, but there are cases when BPF programs are not small any more and placed on the hot execution path, e.g. if we talk about a security system monitoring syscalls. In such situations even small overhead is drastically multiplied and accumulated, and it only makes sense to fully understand the system performance to avoid nasty surprises.

It seems many other people also would like to know more about this topic, thus want to share results of my investigation.

Introduction
Current state of things

BPF Instruction Set
Batching of map operations
Bloom filter map
Task local storage
BPF program pack allocator
BPF 2 BPF

How to analyze BPF performance?

Talking to the compiler
Aggregated counters
Manual instrumentation
Top-down approach
Profiling of BPF programs

Modeling of BPF programs

PSquare: practical quantiles

04 Oct 2021

0. Motivation

Recently I’ve started to notice an interesting pattern. When you take something you thought was simple and look deep inside with a magnifying glass, it usually opens the whole new world of fascinating discoveries. It could be one of the principles of the universe, or just me overreacting on simple things. In any case one of such examples I wanted to share in this blog post, I hope it will bring the same joy to the readers as it did to me. Let’s talk about quantiles!

How many engineers does it take to make subscripting work?

03 Mar 2021

Are you tired of this syntax in PostgreSQL?

SELECT jsonb_column->'key' FROM table;
UPDATE table SET jsonb_column =
            jsonb_set(jsonb_column, '{"key"}', '"value"');

The select part is actually fine. But for updates, especially for complex updates, it could be pretty verbose and far from being ergonomic. What would you say to this syntax instead?

SELECT jsonb_column['key'] FROM table;
UPDATE table SET jsonb_column['key'] = '"value"';

Older Newer