Evoke configure follower hangs (journal logs don't say much)

Greetings,

After resolving the original issue (Authenticator pod failing to authenticate) where the seed config wasn’t being downloaded within the Openshift pods from the conjur master node, I have now manually generated the seed file and tried to configure the follower pods.

Unfortunately I am not able to perform the manual follower configuration by generating the seed file, copying to the conjur-appliance pods and running the unpack and configure follower commands.

evoke seed follower $ConjurFollowerAddress > follower-seed.tar

Archive::Tar::PosixHeader has been renamed to Archive::Tar::Minitar::PosixHeader

$ oc cp follower-seed.tar conjur-follower-78967f5f7-qj7fv:/opt/follower-seed.tar

$ oc exec conjur-follower-78967f5f7-qj7fv evoke unpack seed /opt/follower-seed.tar
tar: Removing leading `/’ from member names
Seed file was successfully unpacked.
Run ‘evoke configure follower’ to configure this machine.

$ oc exec conjur-follower-78967f5f7-qj7fv evoke configure follower

Hangs…

Thinking it could be any network related issue, I have checked from the pods if I can reach the master’s required ports:

netcat -v $ConjurMasterAddress 443

Connection to $ConjurMasterAddress 443 port [tcp/https] succeeded!
^C

netcat -v $ConjurMasterAddress 5432

Connection to $ConjurMasterAddress 5432 port [tcp/postgresql] succeeded!
^C

netcat -v conjurmaster.dtt-iam.xyz 1999

Connection to $ConjurMasterAddress 1999 port [tcp/*] succeeded!
^C

I have intentionally blocked the 5432 port in a security group rule, for example, and the configure follower command at least timed out, but now the command just hangs for hours with no response.

Is there any debug/trace log I can enable or check ? The pods/journal logs only don’t show anything useful so far.

Thanks in advance.

After a very long time (more likely TCP timeout it finally failed):

psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Still looking for postgres log to understand the cause of the issue, but no progress so far on where exactly the “evoke configure follower” command is hanging.

Hi,

The follower container doesn’t have enough resources assigned to it. In OCP environments, I’ve found that most have a default policy defined at the cluster level that define resource limits of ~500mCPU and 256mb of memory or something like that. This isn’t enough for the follower’s PG database to start up. The server closed connection unexpectedly is just a side effect of PG slowly trying to replicate the database with no memory or CPU to help it along and the connection finally getting interrupted by something. I’d recommend upping your resource limits for the pod so that each follower container gets 2 full CPU and 4GB of memory. I typically deploy a pod with 4 cores and 8GB of memory available so I can deploy a pair of followers. Obviously, if you expect to need more followers, increase the amount of memory/CPU available to the pod accordingly.

Regards,
Nathan

Hi @nathan.whipple, thanks for answering.

I couldn’t find any explicit limit or quota defined:

$ oc get limits --all-namespaces
No resources found.

$ oc get quota --all-namespaces
No resources found.

I still have increased the pod’s definition requests and so far there is enough resources in the pod, but the follower config is still hanging.

I still haven’t found any log or flag I can enable to be able to follow-up the details of the configure command. Is that possible ?

$ oc exec conjur-follower-78967f5f7-jlss6 'nproc'
2

$ oc exec conjur-follower-78967f5f7-jlss6 'free'
              total        used        free      shared  buff/cache   available
Mem:        8161840     2160752     4068084       19708     1933004     5780464
Swap:             0           0           0


$ oc cp follower-seed.tar conjur-follower-78967f5f7-jlss6:/opt/follower-seed.tar

$ oc exec conjur-follower-78967f5f7-jlss6 evoke unpack seed /opt/follower-seed.tar
tar: Removing leading `/' from member names
Seed file was successfully unpacked.
Run 'evoke configure follower' to configure this machine.

$ oc exec conjur-follower-78967f5f7-jlss6 evoke configure follower
...hanging...

So I noted that the 8_configure_followers.sh script hardcodes the copy of the seed file to /tmp:

configure_follower() {
  local pod_name=$1

  KEYS_COMMAND=""

  printf "Configuring follower %s...\n" $pod_name

  copy_file_to_container $FOLLOWER_SEED "/tmp/follower-seed.tar" "$pod_name"

While the deployment (openshift/conjur-follower.yaml) configures the SEEDFILE_DIR variable to /tmp/seedfile:

      - name: SEEDFILE_DIR
        value: /tmp/seedfile

So the /tmp/seedfile/start-follower.sh script was looking for the seed file in the wrong directory:

if [[ -f "$SEEDFILE_DIR/follower-seed.tar" ]]; then

I have updated the deployment’s environment variable to /tmp and now the script executed during the pod’s startup:

      - name: SEEDFILE_DIR
        value: /tmp/

So now at least I am able to see the follower’s configs steps in the journal/container logs:

+ exec conjur-plugin-logger etcd
2019-11-13T03:50:21.000+00:00 conjur-follower-797555578f-6b4vf cron[45]: (CRON) INFO (pidfile fd = 3)
2019-11-13T03:50:21.000+00:00 conjur-follower-797555578f-6b4vf cron[45]: (CRON) INFO (Running @reboot jobs)
Unpacking seed...
2019-11-13T03:50:21.000+00:00 conjur-follower-797555578f-6b4vf evoke-health: [2019-11-13 03:50:25] INFO  WEBrick 1.4.2
2019-11-13T03:50:25.000+00:00 conjur-follower-797555578f-6b4vf evoke-health: [2019-11-13 03:50:25] INFO  ruby 2.5.5 (2019-03-15) [x86_64-linux-gnu]
2019-11-13T03:50:25.000+00:00 conjur-follower-797555578f-6b4vf evoke-health: [2019-11-13 03:50:25] INFO  WEBrick::HTTPServer#start: pid=51 port=5610
2019-11-13T03:50:21.000+00:00 conjur-follower-797555578f-6b4vf evoke-seed: [2019-11-13 03:50:26] INFO  WEBrick 1.4.2
2019-11-13T03:50:26.000+00:00 conjur-follower-797555578f-6b4vf evoke-seed: [2019-11-13 03:50:26] INFO  ruby 2.5.5 (2019-03-15) [x86_64-linux-gnu]
2019-11-13T03:50:26.000+00:00 conjur-follower-797555578f-6b4vf evoke-seed: [2019-11-13 03:50:26] INFO  WEBrick::HTTPServer#start: pid=47 port=5612
2019-11-13T03:50:21.000+00:00 conjur-follower-797555578f-6b4vf evoke-info: [2019-11-13 03:50:26] INFO  WEBrick 1.4.2
2019-11-13T03:50:26.000+00:00 conjur-follower-797555578f-6b4vf evoke-info: [2019-11-13 03:50:26] INFO  ruby 2.5.5 (2019-03-15) [x86_64-linux-gnu]
2019-11-13T03:50:26.000+00:00 conjur-follower-797555578f-6b4vf evoke-info: [2019-11-13 03:50:26] INFO  WEBrick::HTTPServer#start: pid=54 port=5611
tar: Removing leading `/' from member names
Seed file was successfully unpacked.
Run 'evoke configure follower' to configure this machine.
Configuring follower...
2019-11-13 03:50:29.714 UTC [184] LOG:  database system was shut down at 2019-11-13 03:50:29 UTC

I am now able to see some replication processes running in the container, but the evoke configure processes are not finishing, so the readiness probe is still failing in port 443.

root 46 0.0 0.0 26360 1456 ? S 03:50 0:00 /usr/bin/logger -p local0 info -t evoke-seed
root 47 0.2 0.6 208820 55264 ? Sl 03:50 0:02 ruby2.5 /opt/conjur/evoke/vendor/bundle/ruby/2.5.0/bin/rackup seed.ru -s webrick -p 5612 -o 127.0.0.1
root 48 0.0 0.0 26360 1456 ? S 03:50 0:00 /usr/bin/logger -p local0 info -t etcd-proxy
root 49 0.0 0.0 26360 1396 ? S 03:50 0:00 /usr/bin/logger -p local0 info -t etcd
root 50 0.0 0.0 26360 1404 ? S 03:50 0:00 /usr/bin/logger -p local0 info -t evoke-health
conjur 51 0.2 0.6 200208 54672 ? Sl 03:50 0:02 ruby2.5 /opt/conjur/evoke/vendor/bundle/ruby/2.5.0/bin/rackup health.ru -s webrick -p 5610 -o 127.0.0.1
root 52 0.0 0.0 26360 1456 ? S 03:50 0:00 /usr/bin/logger -p local0 info -t evoke-cluster
root 53 0.0 0.0 26360 1372 ? S 03:50 0:00 /usr/bin/logger -p local0 info -t evoke-info
conjur 54 0.2 0.6 200296 54560 ? Sl 03:50 0:02 ruby2.5 /opt/conjur/evoke/vendor/bundle/ruby/2.5.0/bin/rackup info.ru -s webrick -p 5611 -o 127.0.0.1
root 59 0.0 0.0 21640 3440 ? S 03:50 0:00 /bin/bash ./run
root 62 0.0 0.0 4400 828 ? S 03:50 0:00 runsv audit
root 63 0.0 0.0 4400 788 ? S 03:50 0:00 runsv main
postgres 75 0.0 0.2 276440 21904 ? S 03:50 0:00 /usr/lib/postgresql/9.4/bin/postgres -c config_file=/etc/postgresql/9.4/audit/postgresql.conf
> root 144 0.0 0.0 18376 3120 ? S 03:50 0:00 /bin/bash /usr/local/bin/evoke configure follower
> root 146 0.1 0.5 217356 47344 ? Sl 03:50 0:01 /opt/conjur/evoke/bin/evoke configure follower
root 153 0.0 0.0 4628 836 ? S 03:50 0:00 sh -c psql "user=* replication=yes host=conj**
root 154 0.0 0.0 88408 6184 ? S 03:50 0:00 /usr/lib/postgresql/9.4/bin/psql user=* repl**
postgres 185 0.0 0.0 276440 3548 ? Ss 03:50 0:00 postgres: checkpointer process
postgres 186 0.0 0.0 276440 5012 ? Ss 03:50 0:00 postgres: writer process
postgres 187 0.0 0.0 276440 3548 ? Ss 03:50 0:00 postgres: wal writer process
postgres 188 0.0 0.0 276844 5756 ? Ss 03:50 0:00 postgres: autovacuum launcher process
postgres 189 0.0 0.0 131456 2484 ? Ss 03:50 0:00 postgres: stats collector process

netcat -v localhost 443

netcat: connect to localhost port 443 (tcp) failed: Connection refused
netcat: connect to localhost port 443 (tcp) failed: Connection refused

From one of the running pods:

/usr/bin/sv status /etc/service/*

down: /etc/service/cluster: 354s; run: log: (pid 50) 354s
down: /etc/service/conjur: 354s
run: /etc/service/cron: (pid 44) 354s
down: /etc/service/etcd: 354s; run: log: (pid 51) 354s
down: /etc/service/etcd-proxy: 354s; run: log: (pid 47) 354s
down: /etc/service/evoke: 354s
run: /etc/service/health: (pid 55) 354s; run: log: (pid 46) 354s
run: /etc/service/info: (pid 54) 354s; run: log: (pid 53) 354s
down: /etc/service/nginx: 354s
run: /etc/service/pg: (pid 45) 354s
run: /etc/service/seed: (pid 49) 354s; run: log: (pid 48) 354s
down: /etc/service/sshd: 354s
down: /etc/service/syslog-forwarder: 354s
run: /etc/service/unlocker: (pid 52) 354s

I wasn’t, unfortunately, able to fix this one yet. I have commented some unnecessary steps like pushing the docker image again, etc and I am able to run the “start” script from https://github.com/cyberark/kubernetes-conjur-deploy, the pod stays running fine but without completing the “evoke configure follower”. Will hopefully find some time to troubleshoot again, but any inputs are welcome :wink:

Hey Jose,

Any luck with this yet?

Hi Jake,

Thanks for checking out. Unfortunately no, there isn’t much debug/trace logs I could find that would assist on making progress with the troubleshooting.

Similar to: Configure standby failing

Cheers.

Hey,

I wanted to check if you found a solution for this

Regards

Hi @hbindra, sorry for the delay. No, unfortunately I was not able to determine the solution and stopped working on it. Curious to know if anyone else faced a similar issue and how they resolved.

Thanks.