Kubernetes DAP follower PSQL cannot be started

When starting followers in K8, they seem to authenticate to the seed service just fine, but the database fails to start for some reason. Here’s the beginning of the startup logs for the follower before the chef script is run:

Starting follower services...
Joined session keyring: 1006511828
*** Running /etc/my_init.d/00_regen_ssh_host_keys.sh...
*** Running /etc/my_init.d/01-clear-run.sh...
*** Running /etc/my_init.d/10_local_hosts.rb...
*** Running /etc/my_init.d/10_syslog-ng.init...
[2020-02-24T17:18:56.325349] WARNING: With use-dns(no), dns-cache() will be forced to 'no' too!;
[2020-02-24T17:18:56.367547] Error establishing SQL connection; type='pgsql', host='', port='5433', username='syslog-ng', database='audit', error='could not connect to server: No such file or directory\x0a Is the server running locally and accepting\x0a connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5433"?\x0a'
<45>1 2020-02-24T17:18:56.326+00:00 dap-follower-7f985798f5-7ccsc syslog-ng 19 - [meta sequenceId="1"] syslog-ng starting up; version='3.21.1'
*** Running /etc/my_init.d/dhgen.sh...
*** Booting runit daemon...
*** Runit started as PID 30
+ exec conjur-plugin-logger etcd
<78>1 2020-02-24T17:18:57.000+00:00 dap-follower-7f985798f5-7ccsc cron 45 - [meta sequenceId="2"] (CRON) INFO (pidfile fd = 3)
<78>1 2020-02-24T17:18:57.000+00:00 dap-follower-7f985798f5-7ccsc cron 45 - [meta sequenceId="3"] (CRON) INFO (Running @reboot jobs)
Unpacking seed...
<43>1 2020-02-24T17:19:56.421+00:00 dap-follower-7f985798f5-7ccsc syslog-ng 19 - [meta sequenceId="1"] Error establishing SQL connection; type='pgsql', host='', port='5433', username='syslog-ng', database='audit', error='could not connect to server: No such file or directory\x0a Is the server running locally and accepting\x0a connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5433"?\x0a'
tar: Removing leading `/' from member names
rm: cannot remove '/tmp/seedfile/follower-seed.tar': Read-only file system
W, [2020-02-24T17:20:00.327525 #86] WARN -- : command rm "/tmp/seedfile/follower-seed.tar" failed
Seed file was successfully unpacked.

And after the chef script, the database fails to start forever:

Error: /usr/lib/postgresql/9.4/bin/pg_ctl /usr/lib/postgresql/9.4/bin/pg_ctl start -D /var/lib/postgresql/9.4/main -l /var/log/postgresql/postgresql-9.4-main.log -w -s -o -c config_file="/etc/postgresql/9.4/main/postgresql.conf" exited with status 1: 
pg_ctl: could not start server
Examine the log output.
Removed stale pid file.
Error: /usr/lib/postgresql/9.4/bin/pg_ctl /usr/lib/postgresql/9.4/bin/pg_ctl start -D /var/lib/postgresql/9.4/main -l /var/log/postgresql/postgresql-9.4-main.log -w -s -o -c config_file="/etc/postgresql/9.4/main/postgresql.conf" exited with status 1: 
pg_ctl: could not start server
Examine the log output.
Removed stale pid file.

Here are the contents of postgresql.conf on the pod:

data_directory = '/var/lib/postgresql/9.4/main'
datestyle = 'iso, mdy'
external_pid_file = '/var/run/postgresql/9.4-main.pid'
hba_file = '/etc/postgresql/9.4/main/pg_hba.conf'
hot_standby = on
ident_file = '/etc/postgresql/9.4/main/pg_ident.conf'
listen_addresses = '0.0.0.0'
log_destination = 'syslog'
log_line_prefix = ''
max_connections = 100
max_wal_senders = 16
node.default_text_search_config = 'pg_catalog.english'
port = 5432
shared_buffers = 32818047kB
ssl = on
ssl_ca_file = '/opt/conjur/etc/ssl/ca.pem'
ssl_cert_file = '/opt/conjur/etc/ssl/conjur.pem'
ssl_key_file = '/opt/conjur/etc/ssl/conjur.key'
unix_socket_directories = '/var/run/postgresql'
wal_keep_segments = 16
wal_level = 'hot_standby'
effective_cache_size = 98454141kB
work_mem = 1312721kB
maintenance_work_mem = 2097152kB
checkpoint_segments = 64
checkpoint_completion_target = 0.9

Unfortunately the PSQL logs at /var/log/postgresql/postgresql-9.4-main.log don’t help:

pg_ctl: could not start server
Examine the log output.
pg_ctl: could not start server
Examine the log output.

Here is what the DAP master node logged during this period:

<13>1 2020-02-24T17:18:54.109+00:00 6fe30da24fc4 nginx - - [meta sequenceId="20"] 10.2.5.8 "POST /authn-k8s/ut-dev/inject_client_cert HTTP/1.1" 200 5 "-" "Go-http-client/1.1" 0.223 0.223
<13>1 2020-02-24T17:18:54.109+00:00 6fe30da24fc4 nginx - - [meta sequenceId="21"] 10.2.5.8 "POST /authn-k8s/ut-dev/neovest/host%2Fconjur%2Fauthn-k8s%2Fut-dev%2Fapps%2Fdap%2Fservice_account%2Fdap-follower/authenticate HTTP/1.1" 200 660 "-" "Go-http-client/1.1" 0.127 0.127
<13>1 2020-02-24T17:18:54.109+00:00 6fe30da24fc4 nginx - - [meta sequenceId="22"] 10.2.5.8 "POST /configuration/neovest/seed/follower HTTP/1.1" 200 17408 "-" "Wget/1.20.3 (linux-musl)" 0.056 0.055

Thank you in advance for any insight.

Hi Josh,

The most frequent cause of a follower hanging on startup like this is insufficient resources provided to the container/pod. As we’re running a full pgsql database, our resource req’s are much higher than typically provisioned for a pod. The lowest we’ve seen run reliably is 2 CPU cores and 4GB of memory per follower container in the pod. Our minimum system requirements are 4 CPU cores and 8GB. Most Kubernetes and OCP clusters we’ve seen have a default resource limit policy applied that limits each container to 250-500mCPU and 256-512MB of memory. I’d check that first and see if reserving more resources for the containers in the pod fixes the issue.

If that doesn’t solve it, then these sorts of issues can be very difficult to troubleshoot over Discourse and I’d recommend opening a support case if you haven’t already.

Regards,
Nate

1 Like

@nathan.whipple Thank you so much for your thoughtful response; it was indeed resource limit within the namespace that had prevented the followers from initializing. I have removed those limits for now and everything is working great. I don’t know how long it would have taken me to discover that on my own, so I am incredibly grateful!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.