Blog

My thoughts and experiments.

© 2016. Dmitry Dolgov All rights reserved.

How to convert your data to jsonb?

“How to start” is always a difficult question, and jsonb isn’t an exception. Here are few notes about converting different types of data into jsonb, that someone can find useful.

Basically there are three possible cases of data conversion:

  • Convert data from inside PostgreSQL
  • Convert data from other database
  • Convert plain data outside database

From inside PostgreSQL

First of all we shouldn’t forget we can build data in jsonb format manually:

select '{"id": 1, "data": "aaa"}'::jsonb;
          jsonb           
--------------------------
 {"id": 1, "data": "aaa"}
select jsonb_build_object('id', 1, 'data', 'aaa');
    jsonb_build_object    
--------------------------
 {"id": 1, "data": "aaa"}

If we already have some relational data we can easy perform one-to-one conversion for both complex and simple data types:

select to_jsonb(timestamp '2016-06-05');
       to_jsonb        
-----------------------
 "2016-06-05T00:00:00"
select to_jsonb(ARRAY[1, 2, 3]);
 to_jsonb  
-----------
 [1, 2, 3]
select to_jsonb('id=>1, data=>"aaa"'::hstore);
          to_jsonb          
----------------------------
 {"id": "1", "data": "aaa"}

Don’t forget that jsonb is just a valid textual json, so all values will be converted to number, string, boolean or null.

And if we want to produce a really complex and well-structured jsonb document from large amount of relational data, jsonb_agg is our friend. This function can transform a recordset into the format column_name: record_value:

select jsonb_agg(query) from (
    select id, data
    from jsonb_table
) query;
                      jsonb_agg                       
------------------------------------------------------
 [{"id": 1, "data": "aaa"}, {"id": 2, "data": "bbb"}]

From other database

Again there are two options how to import data from another database:

  • Import right in the json format as plain data (see following section)
  • Import as relational data and then convert from inside PostgreSQL as in previous section

And in any case you should create all indices and make sure they’re correct. Let’s see few examples:

MongoDB

We can easily create a json dump of MongoDB database and then load it with minimal modifications:

$ mongoexport                       \
    --db database_name              \
    --collection collection_name    \
    --jsonArray                     \
    -out dump.json

But you should be aware of specific data types, since BSON isn’t 100% compatible with textual json. To be more precise I’m talking about data_binary, data_date, data_timestamp, data_regex, data_oid etc, see documentation). E.g. when you’ll create a dump of collection with data_date field, you’ll get something like this:

"created_at": {
    "$date": 1445510017229
}

and you may decide to move this value one level up or keep this structure.

There is also another interesting option, which is related to the ToroDB.

ToroDB is an open source project that turns your RDBMS into a MongoDB-compatible server, supporting the MongoDB query API and MongoDB’s replication, but storing your data into a reliable and trusted ACID database.

So it’s like NoSQL over RDBMS. You can setup ToroDB as a hidden read-only replica of a MondoDB replica set. Then when you’ll be ready you can examine ToroDB data structure and convert it into jsonb as in previous section.

Speaking about indices - it’s possible to cover good amount of queries using GIN index for jsonb column, but since it available only for small list of operators, you should probably add separate indices for range queries.

MySQL

JSON data type format in MySQL is pretty close to PostgreSQL, we can even use mysqldump to convert one into another:

$ mysqldump                                     \
    --compact                                   \
    --compatible=postgresql                     \
    database_name                               \
    table_name | sed -e 's/\\\"/"/g' > dump.sql

$ cat ./dump.sql
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE "table_name" (
  "data" json DEFAULT NULL
);
/*!40101 SET character_set_client = @saved_cs_client */;
INSERT INTO "table_name" VALUES ('{"aaa": 1, "bbb": 2}'),('{"aaa": 3, "bbb": 4}'),('{"aaa": 5, "bbb": 6}');
$ psql < dump.sql

Just be careful about double quotes escaping and that’s it.

From plain data

And finally we have an option to import plain json data into PostgreSQL. But imagine a situation, when we need to process not so well formatted data. Since jsonb should strictly follow the json format, what can we do in that case?

It depends on how badly our document is broken. If document structure is preserved, but there are some issues with formatting (one quote instead of double or even without quotes, no commas and so on), it’s possible to fix it using (oh my gosh) node.js and more precisely the json5 extension and the corresponding library:

// format.js

var JSON5 = require('json5');
var stdin = process.stdin;
var stdout = process.stdout;

var inputChunks = [];

stdin.resume();
stdin.setEncoding('utf8');

stdin.on('data', function (chunk) {
    inputChunks.push(chunk);
});

stdin.on('end', function () {
    var inputData = inputChunks.join();

    var outputData = inputData
        .split('\n')
        .filter(function(input) {
            if(input) {
                return true;
            }
        })
        .map(function(input) {
            var parsed = JSON5.parse(input);
            var output = JSON.stringify(parsed);
            return output;
        }).join('\n');

    stdout.write(outputData);
});
$ cat data.json | node format.js > data_formatted.json
=# COPY table_name(jsonb_column_name) from 'data_formatted.json'

But if document structure is broken too - nothing can help, we need to fix it manually using one of json linters.

The Knife

I really don’t know how I lived before without pandoc. It’s an amazing tool, that saved me from a terrible pain of latex -> MS Word convertion. If you’re writing many scientific documents in latex format, and faced with the demand to convert it into MS Word, you can understand me. It’s like a tiresome point-and-click game, which can consume unbelievable amount of time.

Pandoc can do it, and I’m not kidding. All you need is to create a source file in markdown format and copy all latex formulas. In this post you can find several details.

What do we need to install pandoc? I’ll advise stack:

$ stack install pandoc pandoc-citeproc pandoc-crossref

pandoc-citeproc and pandoc-crossref will be used to generate nice bibliography and references.

Now we can do something like this:

$$\frac{\partial \vec{u}}{\partial t} + (\vec{u} \cdot \nabla) \vec{u} = - \frac{1}{\rho} \nabla p + \nabla \sigma + \vec{f}$$ {#eq:navier_stokes:motion}

$$\frac{\partial \rho}{\partial t} + \nabla \cdot (\rho \vec{u}) = 0$$ {#eq:navier_stokes:continuity}

ang get a nice result (e.g. from Word Online):

Here you can see a label eq:navier_stokes:motion for corresponding equation. General template for references is:

some_content {#type:label}

label will be used here [@type:label]

More documentation about types and usage is here. You can enable pandoc-crossref using command line argument --filter:

$ pandoc --filter pandoc-crossref -o result.docx source.md 

Note, that if we will use begin/end environment or “\\” for multiple equations, pandoc couldn’t generate MS Word formula. Also, you shouldn’t forget about empy line between equations.

To use bibliography we can create a regular bib file and enable pandoc-cireproc like this:

$ pandoc --filter pandoc-crossref --bibliography report.bib -o result.docx source.md

The default citation style in Pandoc is Chicago author-date, but you can change it. E.g. if you’re writing an academic paper in Russian, I bet you would like to use GOST standard. You can get a corresponding csl file here, and use it like this:

$ pandoc --filter pandoc-crossref --bibliography report.bib --csl=gost.csl -o result.docx source.md

The last one thing is a language. If you’re writing in Russian, you can suffer from encoding issues - to avoid this you need to specify language:

$ pandoc --filter pandoc-crossref --bibliography report.bib --csl=gost.csl -o result.docx source.md -V lang:russian

That’s it, you will get a well-formed MS Word file with nice formulas.

Compare incomparable: PostgreSQL vs Mysql vs Mongodb

As such, there’s really no “standard” benchmark that will inform you about the best technology to use for your application. Only your requirements, your data, and your infrastructure can tell you what you need to know.

NoSql is everywhere and we can’t escape from it (although I can’t say we want to escape). Let’s leave the question about reasons outside this text, and just note one thing - this trend isn’t related only to new or existing NoSql solutions. It has another side, namely the schema-less data support in traditional relational databases. It’s amazing how many possibilities hiding at the edge of the relational model and everything else. But of course there is a balance that you should find for your specific data. It can’t be easy, first of all because it’s required to compare incomparable things, e.g. performance of a NoSql solution and traditional database. Here in this post I’ll make such attempt and show the comparison of jsonb in PostgreSQL, json in Mysql and bson in Mongodb.

What the hell is going on here?

Breaking news:

  • PostgreSQL 9.4 - a new data type jsonb with slightly extended support in the upcoming release PostgreSQL 9.5
  • Mysql 5.7.7 - a new data type json

and several other examples (I’ll talk about them later). Of course these data types supposed to be binary, which means great performance. Base functionality is equal across the implementations because it’s just obvious CRUD. And what is the oldest and almost cave desire in this situation? Right, performance benchmarks! PostgreSQL and Mysql were choosen because they have quite similar implementation of json support, Mongodb - as a veteran of NoSql. An EnterpriseDB research is slightly outdated, but we can use it as a first step for the road of a thousand li. A final goal is not to display the performance in artificial environment, but to give a neutral evaluation and to get a feedback.

Some details and configurations

The pg_nosql_benchmark from EnterpriseDB suggests an obvious approach - first of all the required amount of records must be generated using different kinds of data and some random fluctuations. This amount of data will be saved into the database, and we will perform several kinds of queries over it. pg_nosql_benchmark doesn’t have any functional to work with Mysql, so I had to implement it similar to PostgreSQL. There is only one tricky thing with Mysql - it doesn’t support json indexing directly, it’s required to create virtual columns and create index on them.

Speaking of details, there was one strange thing in pg_nosql_benchmark. I figured out that few types of generated records were beyond the 4096 bytes limit for mongo shell, which means these records were just dropped out. As a dirty hack for that we can perform the inserts from a js file (and btw, that file must be splitted into the series of chunks less than 2GB). Besides, there are some unnecessary time expenses, related to shell client, authentication and so on. To estimate and exclude them I have to perform corresponding amount of “no-op” queries for all databases (but they’re actually pretty small).

After all modifications above I’ve performed measurements for the following cases:

  • PostgreSQL 9.5 beta1, gin
  • PostgreSQL 9.5 beta1, jsonb_path_ops
  • PostgreSQL 9.5 beta1, jsquery
  • Mysql 5.7.9
  • Mongodb 3.2.0 storage engine WiredTiger
  • Mongodb 3.2.0 storage engie MMAPv1

Each of them was tested on a separate m4.xlarge amazon instance with the ubuntu 14.04 x64 and default configurations, all tests were performed for 1000000 records. And you shouldn’t forget about the instructions for the jsquery - bison, flex, libpq-dev and postgresql-server-dev-9.5 must be installed. All results were saved in json file, we can visualize them easily using matplotlib (see here).

Besides that there was a concern about durability. To take this into account I made few specific configurations (imho some of them are real, but some of them are quite theoretical, because I don’t think someone will use them for production systems):

  • Mongodb 3.2.0 journaled (writeConcern j: true)
  • Mongodb 3.2.0 fsync (transaction_sync=(enabled=true,method=fsync))
  • PostgreSQL 9.5 beta 1, no fsync (fsync=off)
  • Mysql 5.7.9, no fsync (innodb_flush_method=nosync)

Results

All charts presented in seconds (if they related to the time of query execution) or mb (if they related to the size of relation/index). Thus, for all charts the smaller value is better.

Select

Insert

Insert with configurations

Update

Update is another difference between my benchmarks and pg_nosql_benchmark. It can bee seen, that Mongodb is an obvious leader here - mostly because of PostgreSQL and Mysql restrictions, I guess, when to update one value you must override an entire field.

Update with configurations

As you can guess from documentation and this answer, writeConcern j:true is the highest possible transaction durability level (on a single server), that should be equal to configuration with fsync. I’m not sure about durability, but fsync is definitely slower for update operations here.

Table/index size

I have a bad feeling about this

Performance measurement is a dangerous field especially in this case. Everything described above can’t be a completed benchmark, it’s just a first step to understand current situation. We’re working now on ycsb tests to make more finished measurements, and if we’ll get lucky we’ll compare the performance of cluster configurations.

PgConf.Russia 2016

It looks like I’ll participate in the PgConf.Russia this year, so if you’re interested in this subject - welcome.

Gentoo and Lenovo u430p - the sad story

Finally, I decided to replace my old laptop, and my chose fell on the Lenovo u430p. As I understand now, it was not a good idea in case of Gentoo =) Actually, I was surprised, how many nerves you can lose only because of the adaptation of you hardware to your requirements. And here is the shortlist of what you shouldn’t forget, if you want to do the same more easily.

EFI loader

So you’ve successfully passed several steps from Gentoo Handbook. One of the last is Grub2 installation and configuration. You’ve completed this, rebooted and…nothing happened, you see the Windows 8 again.

The clue to this problem is the “Secure Boot” option, which enabled by default in BIOS.

iwlwifi

Next big disaster is the iwlwifi driver for the Intel Wireless 7260. The most of wifi routers are working in the mixed 11bgn mode, and 11n drives iwlwifi (and you with him) mad. There are an endless disconnections and the terrible instability.

The only known solution is cut out the 11n mode:

# /etc/modprobe.d/iwlwifi.conf

options iwlwifi 11n_disable=1

And you shouldn’t forget to compile iwlwifi as kernel module (otherwise, obviously, this option will not be applied). The last step is the firmware installation. You can download the iwlwifi-7260-9.ucode, place it in /lib/firmware, and configure to load this firmware with kernel:

Device Driver ->
    Generic Driver Options ->
        [*] Include in-kernel firmware blobs in kernel library
        (iwlwifi-7260-9.ucode) External firmware blobs to build into kernel library
        (/lib/firmware) Firmware blobls root directory

Btw, one more note - loos like iwlwifi-7269-9.ucode is working more stable, then iwlwifi-7260-10.ucode.

Microphone

I don’t know why, but this there was no working configuration for microphone out of box =) I installed alsa and pulseaudio (the last one for the Skype, of course), and issues with the audio capture were really unexpected for me.

Actually, I though, that everything was unmutes in the alsamixer, but I was wrong:

$ amixer -c 1
...
Simple mixer control 'Capture',0 
Capabilities: cvolume cswitch 
Capture channels: Front Left - Front Right 
Limits: Capture 0 - 63 
Front Left: Capture 63 [100%] [30.00dB] [off] 
Front Right: Capture 63 [100%] [30.00dB] [off]
...

You can unmute the Capture, 0 control by this command (-c 1 is the card number):

$ amixer -c 1 Capture, 0 80% 40% unmute cap

And btw, don’t forget about the web cam:

Device Drivers ->
    Multimedia support ->
        [*] Cameras/video grabbers support
        [*] Media USB Adapters ->
            <*> USB input event device support

Windows 8 loading

Ok, it’s well enough. But if you’ll try to load now from the grub2 menu to Win8, you’ll suprised because of the messages error: can't find command drivemap and error: invalid EFI file path.

To avoid this problem you should create custom menu entry for grub.cfg with chainloader for Win8:

#!/bin/sh
# /etc/grub.d/40_custom
exec tail -n +3 $0
# This file provides an easy way to add custom menu entries.  Simply type the
# menu entries you want to add after this comment.  Be careful not to change
# the 'exec tail' line above.
menuentry 'Windows 8 (UEFI loader)' --class windows --class os $menuentry_id_option 'osprober-chain-02E42074E4206BDB' {
	search --file --no-floppy --set=root /EFI/Microsoft/Boot/bootmgfw.efi
	chainloader (${root})/EFI/Microsoft/Boot/bootmgfw.efi
}

Don’t forget to update grub.cfg:

# grub2-mkconfig -o /boot/grub/grub.cfg

Touchpad

Default configuration for touchpad is terrible…

You can improve it - just don’t forget, that this model doesn’t have the hardware right mouse button (so you shouldn’t disable software button). Here is my configuration:

# /etc/X11/xorg.conf.d/50-synaptics.conf

Section "InputClass"
    Identifier "touchpad"
    MatchIsTouchpad "on"
    MatchDevicePath "/dev/input/event*"
    Driver "synaptics"
 
    Option "TapButton1" "1"
    Option "TapButton2" "3"
    Option "TapButton3" "2"

    # accurate tap-to-click!
    Option "FingerLow" "50"
    Option "FingerHigh" "55"
 
    # prevents too many intentional clicks
    Option "PalmDetect" "0"
 
    # vertical and horizontal scrolling, use negative delta values for "natural" scrolling
    Option "VertTwoFingerScroll" "1"
    Option "VertScrollDelta" "75"
    Option "HorizTwoFingerScroll" "1"
    Option "HorizScrollDelta" "75"
 
    Option "MinSpeed" "1"
    Option "MaxSpeed" "2"
 
    Option "AccelerationProfile" "2"
    Option "ConstantDeceleration" "4"
EndSection

And one more note - if you’re using the laptop-mode and a wireless keyboard from Logitech, it probably will be better to put this device into blacklist to avoid annoying problems with an unexpectedly frozen keyboard:

$ lsusb
......
Bus 001 Device 002: ID 046d:c52b Logitech, Inc. Unifying Receiver
......
# /etc/laptop-mode/conf.d/runtime-pm.conf

AUTOSUSPEND_RUNTIME_DEVID_BLACKLIST="046d:c52b"

End

It was interesting journey to the world of bugs, strange configurations and default options. I hope this shortlist can be useful, at least for me in the future =)