Stories by Benjamin Morel on Medium

To JOIN or not to JOIN

Benjamin Morel — Tue, 08 Jan 2019 23:08:15 GMT

The question crops up regularly on StackOverflow:

And many, many more. The question pretty much always boils down to: given a table A with a foreign key to table B, is it faster to perform a single query to load all A’s together with their B’s using a JOIN:

SELECT A.*, B.* FROM A JOIN B ON B.id = A.b_id;

Or to load the A’s alone, then query each B individually depending on the values returned by the first query:

SELECT * FROM A;
SELECT * FROM B WHERE id = …;
SELECT * FROM B WHERE id = …;
SELECT * FROM B WHERE id = …;
…

The latter approach is commonly referred to as the N+1 problem (we’ll see why below): you’re executing at most N + 1 queries, where N is the number of records returned from the first table.

Yet another way is to execute 2 queries, one to load the A’s, then another one to load all the B’s in a single query, depending on the result of the first query:

SELECT * FROM A;
SELECT * FROM B WHERE id IN(…, …, …);

Most of the answers you’ll find to the question “which one is faster?” will be along the lines of “JOIN will always be faster”, “it depends”, and “you should benchmark it”.

That’s exactly what we’re here for today.

The benchmark

I’m using a sample employee database readily available on GitHub. I’m loading entries from the salaries table, each salary record referencing a distinct record from the employees table:

https://medium.com/media/183290dd23c778170d2a8dc1f2f3809c/href

No surprise here, JOIN is indeed, by far, the fastest way, followed by the WHERE IN approach.

The N+1 query performance, on the other hand, drops drastically as soon as you’re selecting more than 1 record: we’re talking about a 10x decrease in performance when loading just 15 records!

The reason is that each query to the database incurs a fixed cost, even in the perfect conditions: connecting through a local socket, using real, non-emulated prepared statements.

When does it matter?

When you write your database queries by hand, you usually know what you’re doing, and should already be using JOINs as appropriate.

A common situation where the N+1 problem typically occurs though, is when using an ORM that abstracts these queries for you, such as Doctrine 2 for PHP. If you’re not careful enough, you may load a collection of entities and traverse associations (Salary -> Employee in this case), and the ORM will lazy-load each record as you access it, effectively issuing N+1 queries — or more, if you have nested relationships.

Given the huge performance gap, you should be on the lookout for every N+1 your application may trigger:

always eager load associations that you know you’re going to traverse
keep an eye on your logs to ensure that you’re not running N+1 queries without knowing it!

The exception to the rule

If a lot of your referenced records are the same (a lot of A’s pointing to the same B), then JOINing records will lead to a lot of duplicated data in the result set, while a N+1 or WHERE IN approach would load the referenced records only once.

How does JOIN stack up against multiple queries, in this case?

Here is a benchmark of the best case scenario, where all salaries point to the same employee:

https://medium.com/media/95c6a96ddc1ee4b64eeabea46beac034/href

(Note that I did not benchmark WHERE IN here, as for a single record, it will be the same as N+1.)

JOIN still wins here, but we see a trend as the curves seem to be close to bumping into each other. Let’s push the number of records further:

https://medium.com/media/1daa5de04799ed295b766af3d0ef436e/href

Ah, now you have it. N+1 is faster than JOIN. For this to happen however, you need to select a lot of records in A, that point to a ridiculously low number of distinct records in B. I’d argue that this is rarely, if ever, the case in any application, so you can safely always use JOIN and be happy!

This is, of course, if the foreign key is on A (-to-one relationship); if the foreign key is on B (-to-many relationship), you may end up with rows from A duplicated many times, and are probably better off with a WHERE IN approach.

The benchmarks above have been run on PHP 7 (PDO) and MySQL 8 on localhost, using prepared statements. Your mileage will vary if you use another programming language / connector / database / dataset, but I expect the overall trend to be very similar. Drop a comment if you find out that it’s not!

You can find the benchmark code in this Gist.

Removing the MySQL root password

Benjamin Morel — Thu, 07 Sep 2017 11:58:40 GMT

There you go again. You just spinned up a virtual machine to do some testing, installed MySQL using your favourite package manager, started the server, and failed to connect:

$ mysql --user=root
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)

Since version 5.7, MySQL is secure-by-default:

a random root password is generated upon installation; you need to read this password from the server log
you have to change this password the first time you connect
you cannot use a blank password because of the validate_password plugin

This is all good security-wise. But if you’re just installing MySQL on a local VM for your own testing, this can become really annoying.

To remove the MySQL root password, just run the following script right after installing and starting the MySQL server:

On MySQL 5.7:

https://medium.com/media/ff791199c08f7ee599cfde8c0e1a3c50/href

On MySQL 8.0:

https://medium.com/media/cfa451f50d126b84edbceac6d7582e2f/href

Note: you must execute this script as root.

The script performs the following actions:

reads the temporary password from the log file
changes this password to another temporary password that passes the validate_password checks
uninstalls the validate_password plugin (or component in MySQL 8)
sets a blank password

You can now connect without a password:

$ mysql --user=root
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 7~
Server version: 5.7.19 MySQL Community Server (GPL)

Copyright (c) 2000, 2017, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

Be careful that this leaves your MySQL installation unsecured, you should not use this for anything serious!

A secure alternative

If you’re mainly using MySQL from the command line, you can keep the root account protected by a password, while still avoiding the inconvenience of having to provide the password on the command line.

Just create a ~/.my.cnf file:

[client]
user = root
password = xxx

You can now just type mysql, and the MySQL client will automatically log in with these credentials.

Creating a Linux service with systemd

Benjamin Morel — Tue, 05 Sep 2017 10:34:21 GMT

Crafting your own services — Photo by Jeff Sheldon on Unsplash

While writing web applications, I often need to offload compute-heavy tasks to an asynchronous worker script, schedule tasks for later, or even write a daemon that listens to a socket to communicate with clients directly.

While there might sometimes be better tools for the job — always consider using existing software first, such as a task queue server —writing your own service can give you a level of flexibility you’ll never get when bound by the constraints of third-party software.

The cool thing is that it’s fairly easy to create a Linux service: use your favourite programming language to write a long-running program, and turn it into a service using systemd.

The program

Let’s create a small server using PHP. I can see your eyebrows rising, but it works surprisingly well. We’ll listen to UDP port 10000, and return any message received with a ROT13 transformation:

https://medium.com/media/c5940ba07d3c126c852e8530332258fc/href https://medium.com/media/674e2c2adc4160cb6bf575a674c8dc1c/href

Let’s start it:

$ php server.php

And test it in another terminal:

$ nc -u 127.0.0.1 10000
Hello, world!
Uryyb, jbeyq!

Cool, it works. Now we want this script to run at all times, be restarted in case of a failure (unexpected exit), and even survive server restarts. That’s where systemd comes into play.

Turning it into a service

Let’s create a file called /etc/systemd/system/rot13.service:

[Unit]
Description=ROT13 demo service
After=network.target
StartLimitIntervalSec=0

[Service]
Type=simple
Restart=always
RestartSec=1
User=centos
ExecStart=/usr/bin/env php /path/to/server.php

[Install]
WantedBy=multi-user.target

You’ll need to:

set your actual username after User=
set the proper path to your script in ExecStart=

That’s it. We can now start the service:

$ systemctl start rot13

And automatically get it to start on boot:

$ systemctl enable rot13

Going further

Now that your service (hopefully) works, it may be important to dive a bit deeper into the configuration options, and ensure that it will always work as you expect it to.

Starting in the right order

You may have wondered what the After= directive did. It simply means that your service must be started after the network is ready. If your program expects the MySQL server to be up and running, you should add:

After=mysqld.service

Restarting on exit

By default, systemd does not restart your service if the program exits for whatever reason. This is usually not what you want for a service that must be always available, so we’re instructing it to always restart on exit:

Restart=always

You could also use on-failure to only restart if the exit status is not 0.

By default, systemd attempts a restart after 100ms. You can specify the number of seconds to wait before attempting a restart, using:

RestartSec=1

Avoiding the trap: the start limit

I personally fell into this one more than once. By default, when you configure Restart=always as we did, systemd gives up restarting your service if it fails to start more than 5 times within a 10 seconds interval. Forever.

There are two [Unit] configuration options responsible for this:

StartLimitBurst=5
StartLimitIntervalSec=10

The RestartSec directive also has an impact on the outcome: if you set it to restart after 3 seconds, then you can never reach 5 failed retries within 10 seconds.

The simple fix that always works is to set StartLimitIntervalSec=0. This way, systemd will attempt to restart your service forever.

It’s a good idea to set RestartSec to at least 1 second though, to avoid putting too much stress on your server when things start going wrong.

As an alternative, you can leave the default settings, and ask systemd to restart your server if the start limit is reached, using StartLimitAction=reboot.

Is that really it?

That’s all it takes to create a Linux service with systemd: writing a small configuration file that references your long-running program.

Systemd has been the default init system in RHEL/CentOS, Fedora, Ubuntu, Debian and others for several years now, so chances are that your server is ready to host your homebrew services!

High-speed inserts with MySQL

Benjamin Morel — Mon, 04 Sep 2017 18:03:25 GMT

Get the dolphin up to speed — Photo by JIMMY ZHANG on Unsplash

When you need to bulk-insert many million records in a MySQL database, you soon realize that sending INSERT statements one by one is not a viable solution.

The MySQL documentation has some INSERT optimization tips that are worth reading to start with.

I will try to summarize here the two main techniques to efficiently load data into a MySQL database.

LOAD DATA INFILE

If you’re looking for raw performance, this is indubitably your solution of choice. LOAD DATA INFILE is a highly optimized, MySQL-specific statement that directly inserts data into a table from a CSV / TSV file.

There are two ways to use LOAD DATA INFILE. You can copy the data file to the server's data directory (typically /var/lib/mysql-files/) and run:

LOAD DATA INFILE '/path/to/products.csv' INTO TABLE products;

This is quite cumbersome as it requires you to have access to the server’s filesystem, set the proper permissions, etc.

The good news is, you can also store the data file on the client side, and use the LOCAL keyword:

LOAD DATA LOCAL INFILE '/path/to/products.csv' INTO TABLE products;

In this case, the file is read from the client’s filesystem, transparently copied to the server’s temp directory, and imported from there. All in all, it’s almost as fast as loading from the server’s filesystem directly. You do need to ensure that this option is enabled on your server, though.

There are many options to LOAD DATA INFILE, mostly related to how your data file is structured (field delimiter, enclosure, etc.). Have a look at the documentation to see them all.

While LOAD DATA INFILE is your best option performance-wise, it requires you to have your data ready as delimiter-separated text files. If you don’t have such files, you’ll need to spend additional resources to create them, and will likely add a level of complexity to your application. Fortunately, there’s an alternative.

Extended inserts

A typical SQL INSERT statement looks like:

INSERT INTO user (id, name) VALUES (1, 'Ben');

An extended INSERT groups several records into a single query:

INSERT INTO user (id, name) VALUES (1, 'Ben'), (2, 'Bob');

The key here is to find the optimal number of inserts per query to send. There is no one-size-fits-all number, so you need to benchmark a sample of your data to find out the value that yields the maximum performance, or the best tradeoff in terms of memory usage / performance.

To get the most out of extended inserts, it is also advised to:

use prepared statements
run the statements in a transaction

The benchmark

I’m inserting 1.2 million rows, 6 columns of mixed types, ~26 bytes per row on average. I tested two common configurations:

Client and server on the same machine, communicating through a UNIX socket
Client and server on separate machines, on a very low latency (< 0.1 ms) Gigabit network

As a basis for comparison, I copied the table using INSERT … SELECT, yielding a performance of 313,000 inserts / second.

LOAD DATA INFILE

To my surprise, LOAD DATA INFILE proves faster than a table copy:

LOAD DATA INFILE: 377,000 inserts / second
LOAD DATA LOCAL INFILE over the network: 322,000 inserts / second

The difference between the two numbers seems to be directly related to the time it takes to transfer the data from the client to the server: the data file is 53 MiB in size, and the timing difference between the 2 benchmarks is 543 ms, which would represent a transfer speed of 780 mbps, close to the Gigabit speed.

This means that, in all likelihood, the MySQL server does not start processing the file until it is fully transferred: your insert speed is therefore directly related to the bandwidth between the client and the server, which is important to take into account if they are not located on the same machine.

Extended inserts

I measured the insert speed using BulkInserter, a PHP class part of an open-source library that I wrote, with up to 10,000 inserts per query:

As we can see, the insert speed raises quickly as the number of inserts per query increases. We got a 6× increase in performance on localhost and a 17× increase over the network, compared to the sequential INSERT speed:

40,000 → 247,000 inserts / second on localhost
12,000 → 201,000 inserts / second over the network

It takes around 1,000 inserts per query to reach the maximum throughput in both cases, but 40 inserts per query are enough to achieve 90% of this throughput on localhost, which could be a good tradeoff here. It’s also important to note that after a peak, the performance actually decreases as you throw in more inserts per query.

The benefit of extended inserts is higher over the network, because sequential insert speed becomes a function of your latency:

max sequential inserts per second ~= 1000 / ping in milliseconds

The higher the latency between the client and the server, the more you’ll benefit from using extended inserts.

Conclusion

As expected, LOAD DATA INFILE is the preferred solution when looking for raw performance on a single connection. It requires you to prepare a properly formatted file, so if you have to generate this file first, and/or transfer it to the database server, be sure to take that into account when measuring insert speed.

Extended inserts on the other hand, do not require a temporary text file, and can give you around 65% of the LOAD DATA INFILE throughput, which is a very reasonable insert speed. It’s interesting to note that it doesn’t matter whether you’re on localhost or over the network, grouping several inserts in a single query always yields better performance.

If you decide to go with extended inserts, be sure to test your environment with a sample of your real-life data and a few different inserts-per-query configurations before deciding upon which value works best for you.

Be careful when increasing the number of inserts per query, as it may require you to:

allocate more memory on the client side
increase the max_allowed_packet setting on the MySQL server

As a final note, it’s worth mentioning that according to Percona, you can achieve even better performance using concurrent connections, partitioning, and multiple buffer pools. See this post on their blog for more information.

The benchmarks have been run on a bare metal server running Centos 7 and MySQL 5.7, Xeon E3 @ 3.8 GHz, 32 GB RAM and NVMe SSD drives. The MySQL benchmark table uses the InnoDB storage engine.

The benchmark source code can be found in this gist. The benchmark result graph is available on plot.ly.