Postgresql Replication, Load Balancing, Backups (and you!)

(Posted: Friday, March 4, 2022)

Postgresql can be daunting and confusing. It’s a FANTASTIC tool to store your data, but can require a lot of heavy lifting in sysops/management that they don’t necessarily provide out of the box. BUT WHO CAN BLAME THEM, it’s an open source RDBMS with a focus on being a bit modular and open, and letting the users/developer decide what they want to get out of it.

That said, it’s still daunting and confusing, and it doesn’t help that for whatever reason Google’s search results on their official documentation still points to Postgresql 9 (the current version is 14 btw). So I’m writing this to make some notes/links to myself so I remember what the heck I learned and how I got here. First some terminology to get through super quick.

Terminology

Write-Ahead-Logs or WALs

WALs are what keeps the constant changes in your db so they can be used for a few reasons, mainly around replications and backups. For backups, you typically want to archive the wals after postgres has verified it’s done with them, it allows an easy ‘replay’ of events that have happened in your database. Same can be said for replication, they are fantastic at replaying the DB events of your master over to your secondaries/replicas.

Replication Slots

Replication slots are typically defined in your master db. You can see your current replication slots with select * from pg_catalog.pg_replication_slots. Typically this keeps track of how out of sync your replicas are, and what WAL chunk they are on. You’ll be able to tell pretty quickly if a replica is down by the active column, as well as the xmin/restart_lsn being out of sync.

Tablespaces

Typically a tablespace is where uh, the TABLES (or indexes, views, etc) are taking SPACE in your storage. This lets you create multiple tablespaces on different drives if you want to, and move elements around freely. You generally would want to do this for performance reasons. For example: You could keep tables or indexes you use more often on a fast NVME drive, vs more block/cold storage stuff on spinning/SATA drives, or do something more wacky like have a tablespace on an NFS/SMB share.

The Basics

Related reading:

Postgresql provides an Ubuntu PPA to get the latest and greatest versions directly from them. For Postgresql 12, I’ve been using this one-liner to download/install/start Postgresql all at once (though you should really refer to the above guide if it’s changed since then.

sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list' && wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add - && sudo apt-get update && sudo apt-get -y install postgresql-12 && sudo systemctl stop postgresql && sudo ln -s /usr/lib/postgresql/12/bin/* /usr/sbin/ && sudo systemctl start postgresql

You can then just login as the postgres user on Ubuntu and set up your configs in /etc/postgresql, or run psql and get direct access to add your own users for remote access. Make sure you take a look at the pg_hba.conf to give your IP range/users proper access or postgresql might reject your connections.

Replication

Related reading:

Let’s assume you have two machines (virtual or not) and you have a postgresql install on each, and some kind of replication user in between them and pg_hba.conf is set up so they can both see/talk to each other. To start the replication process, I can run this command on the secondary as the postgres user (fill in your master IP as appropriate):

/usr/lib/postgresql/12/bin/pg_basebackup -h {YOUR MASTER IP} -D /var/lib/postgresql/12/main -U replication -P -Xs -S secondary --create-slot -R --checkpoint=fast &&\
/usr/lib/postgresql/12/bin/pg_ctl -D /etc/postgresql/12/main -l /var/log/postgresql/postgresql-12.log -w start

This will start a pg_basebackup process that will stream all the master’s contents to the secondary (including absolute tablespace paths), under the replication slot (which it will create itself). The tablespace paths are a big one, so you should make absolutely sure you have your mounts etc set up BEFORE you run this command, or you might get a bloated root drive really fast. Not that I’ve ever done that…

Also worth noting that the 2nd part of that line will start postgresql after it’s finished. You should want to do this right away after replication, or else your master will be just hoarding WALs for the replication slot waiting for the replica to come up so it can send them. Best to start the database right after it’s finished replicating.

Hot Standby

Related reading:

Adding hot_standby=true to our secondary’s config opens the secondary up for read only connections/queries, meaning we can start using it for queries we don’t want to bother the primary for, who could be doing a lot of writing/CPU cycles, while secondaries in comparison are using way less resources just sitting with data. There’s a catch though, well not really a ‘catch’ but more of a heads up. Since we are constantly replicating data, we can’t necessarily make queries on tables that are always being updated since they could be out of date. So to make read-only queries, we have to set a time-limit so queries can’t take “forever” on the secondary, so the primary always has time to catch up at some point.

You can set this limit with max_standby_archive_delay or max_standby_streaming_delay to your choice, I like to keep this pretty long, like 10-15mins. It also serves as a great way to keep bad queries at bay, since the query will cancel after your limit whether you like it or not.

Load Balancing

Related reading:

If we have more than one server, we might as well use all of them the best we can to get the best performance per dollar. Postgresql doesn’t seem to have any kind of solution built-in for this, so we have to find a solution/make one ourselves. I like to do this pretty minimally using a combination of HAProxy and xinetd. HAProxy being a well known proxy/load balancer that is very easy to configure, and xinetd as an super small HTTP server on our postgresql servers to tell HAProxy if they are up/down, or primary/secondary, or whatever you want based on a HTTP response code.

(Protip: Once you get comfortable with xinetd, I bet you can think of a lot of creative ways to output for other services in your infra)

xinetd

Starting with xinetd, make a file in /etc/xinetd.d/pgsqlchk with the following contents:

service pgsqlchk  
{  
        flags           = REUSE  
        socket_type     = stream  
        port            = 23267  
        wait            = no  
        user            = nobody  
        server          = /opt/pgsqlchk  
        log_on_failure  += USERID  
        disable         = no  
        only_from       = 0.0.0.0/0  
        per_source      = UNLIMITED  
}  

This defines a service called pgsqlchk that will run on port 23267 and will output the stdout of /opt/pgsqlchk. Super simple. Now lets make the /opt/pgsqlcheck as a simple bash script.

#!/bin/bash

PGSQL_HOST="localhost"
PGSQL_PORT="5432"
PGSQL_DATABASE="postgres"
PGSQL_USERNAME="postgres"
export PGPASSWORD="passwd"
TMP_FILE="/tmp/pgsqlchk.out"
ERR_FILE="/tmp/pgsqlchk.err"

VALUE=`/usr/bin/psql -t -c "select pg_is_in_recovery()" 2> /dev/null`

if [ $VALUE == "t" ]
then
    /bin/echo -e "HTTP/1.1 206 OK\r\n"
    /bin/echo -e "Content-Type: Content-Type: text/plain\r\n"
    /bin/echo -e "\r\n"
    /bin/echo "Standby"
    /bin/echo -e "\r\n"
elif [ $VALUE == "f" ]
then
    /bin/echo -e "HTTP/1.1 200 OK\r\n"
    /bin/echo -e "Content-Type: Content-Type: text/plain\r\n"
    /bin/echo -e "\r\n"
    /bin/echo "Primary"
    /bin/echo -e "\r\n"
else
    /bin/echo -e "HTTP/1.1 503 Service Unavailable\r\n"
    /bin/echo -e "Content-Type: Content-Type: text/plain\r\n"
    /bin/echo -e "\r\n"
    /bin/echo "DB Down"
    /bin/echo -e "\r\n"
fi

xinetd will be running this script as the output when a client hits port 23267. It will log into postgres, check if the server is “in recovery” (recovery aka a secondary), and depending on if the result is t or f we return either HTTP 200/206 code, otherwise assume HTTP 503 which indicates the server is down. We can use this little service with HAproxy so it knows what server is what. Make sure you start the xinetd service too!

HAProxy

All we need to do now is point HAProxy to our IPs and xinetd service and let those connections fly. The config can be as minimal or as crazy as you want, but let’s do a simple config (or see all the available options here). Let’s assume you have two postgresql servers that are ready to take 100 connections each concurrently, a primary and secondary.

global
    maxconn 200

defaults
    log global
    mode tcp
    retries 2
    timeout client 30m
    timeout connect 4s
    timeout server 30m
    timeout check 5s

listen stats
    mode http
    bind *:7000
    stats enable
    stats uri /

listen live_replicas
    bind *:6005
    mode tcp
    option httpchk
    http-check expect status 206
    balance roundrobin
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server {A POSTGRESQL SERVER} {A POSTGRESQL SERVER}:5432 maxconn 100 check port 23267
    server {A POSTGRESQL SERVER} {A POSTGRESQL SERVER}:5432 maxconn 100 check port 23267

We’re going to bind to port 6005 for our connections, giving the option option httpchk for our xinetd service we set up, and http-check expect status 206 is the line we want for HAProxy to check to see which servers are replicas or not. Finally we define each of our server lines with our IPs/hostnames/ports, and what port they need to check to look for that sweet sweet HTTP 206. You can check your config yourself by checking port 7000 in a browser to see if everything is green/gravy.

When done, just start pointing your postgresql clients to your HAProxyIP:6005 and you’ll connect to one of the postgresql instances.

Backups

Related reading:

When you get to REALLY large databases, like >1TB size, the included pg_basebackup or pg_dump doesn’t really cut it. pg_dump only does a direct SQL data export (no indexes etc) and the recovery is awful, and pg_basebackup can’t do parallel threads and has some WAL jank when getting from primary (tldr; it blocked vacuuming once and almost put us in an unrecoverable state and now I don’t trust it anymore).

I’ve started using pgBackrest and it seems to have solved all my woes/problems, at least for now.

Setting up a stanza is really easy too. A stanza is just a repository for where your backups will go, and you can have multiple different stanzas with different options/configs for each. So you could have a local repo, or you could have a repo on Amazon S3/some cloud service with a different retention, or something wackier. It doesn’t feel out of reach/control and gives you a lot of flexibility.

After creating a stanza (by following their guide), a config can be as easy as:

[global]
repo1-path=/mnt/somenfshare/pgbackups/
repo1-retention-full=2
repo1-retention-diff=3
start-fast=y
process-max=15

[global:archive-push]
compress-level=3

[justatest]
pg1-path=/var/lib/postgres/12/main

I set up a justatest stanza with a full backup retention of 2 and diff backups of 3. When the number of full backups goes over 2, the previous backups (and their diff backups) get expired/deleted automatically (along with WALs we’ve been archiving for them). We’re also gzipping the wals/backups with a compress level of 3, and up to 15 processes at once.

Running a full backup manually is as simple as pgbackrest --stanza=justatest --type=full --log-level-console=info backup Making a diff backup is as easy as changing type to diff too, which makes it really easy to script, but generally this should all be part of a cron task (like mentioned in their docs).

On a ~10TB database, using pg_basebackup would take about 4 days to complete, pgBackrest, takes literally just under 12hrs which is a MASSIVE improvement, AN EIGHT TIMES SPEED INCREASE.

Things that can/will go wrong

WALs bloat in pg_wal and/or explode in size

This typically happens if one of your replicas are down, or the database has a lot of changes all at once and hasn’t had time to expire/clear them out. This can fill a storage device pretty quickly if you aren’t ready for it and crash your database/server if not handled correctly. For a replica being down/out of commission, you probably want to DROP the replication slot, so the primary database stops saving the WALs for it.

If you need to, the command is select pg_drop_replication_slot('{YOUR SLOT NAME HERE}');. You of course will have to re-replicate your data since the primary wont be saving WALs anymore, but it will also save your primary from a storage disaster :D

Related reading: