Evoke configure follower hangs (journal logs don't say much)

joalmaraz · November 12, 2019, 12:47am

Greetings,

After resolving the original issue (Authenticator pod failing to authenticate) where the seed config wasn’t being downloaded within the Openshift pods from the conjur master node, I have now manually generated the seed file and tried to configure the follower pods.

Unfortunately I am not able to perform the manual follower configuration by generating the seed file, copying to the conjur-appliance pods and running the unpack and configure follower commands.

evoke seed follower $ConjurFollowerAddress > follower-seed.tar

Archive::Tar::PosixHeader has been renamed to Archive::Tar::Minitar::PosixHeader

$ oc cp follower-seed.tar conjur-follower-78967f5f7-qj7fv:/opt/follower-seed.tar

$ oc exec conjur-follower-78967f5f7-qj7fv evoke unpack seed /opt/follower-seed.tar
tar: Removing leading `/’ from member names
Seed file was successfully unpacked.
Run ‘evoke configure follower’ to configure this machine.

$ oc exec conjur-follower-78967f5f7-qj7fv evoke configure follower

Hangs…

Thinking it could be any network related issue, I have checked from the pods if I can reach the master’s required ports:

netcat -v $ConjurMasterAddress 443

Connection to $ConjurMasterAddress 443 port [tcp/https] succeeded!
^C

netcat -v $ConjurMasterAddress 5432

Connection to $ConjurMasterAddress 5432 port [tcp/postgresql] succeeded!
^C

netcat -v conjurmaster.dtt-iam.xyz 1999

Connection to $ConjurMasterAddress 1999 port [tcp/*] succeeded!
^C

I have intentionally blocked the 5432 port in a security group rule, for example, and the configure follower command at least timed out, but now the command just hangs for hours with no response.

Is there any debug/trace log I can enable or check ? The pods/journal logs only don’t show anything useful so far.

Thanks in advance.

joalmaraz · November 12, 2019, 3:08am

After a very long time (more likely TCP timeout it finally failed):

psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Still looking for postgres log to understand the cause of the issue, but no progress so far on where exactly the “evoke configure follower” command is hanging.

nathan.whipple · November 12, 2019, 2:52pm

Hi,

The follower container doesn’t have enough resources assigned to it. In OCP environments, I’ve found that most have a default policy defined at the cluster level that define resource limits of ~500mCPU and 256mb of memory or something like that. This isn’t enough for the follower’s PG database to start up. The server closed connection unexpectedly is just a side effect of PG slowly trying to replicate the database with no memory or CPU to help it along and the connection finally getting interrupted by something. I’d recommend upping your resource limits for the pod so that each follower container gets 2 full CPU and 4GB of memory. I typically deploy a pod with 4 cores and 8GB of memory available so I can deploy a pair of followers. Obviously, if you expect to need more followers, increase the amount of memory/CPU available to the pod accordingly.

Regards,
Nathan

joalmaraz · November 13, 2019, 1:06am

Hi @nathan.whipple, thanks for answering.

I couldn’t find any explicit limit or quota defined:

$ oc get limits --all-namespaces
No resources found.

$ oc get quota --all-namespaces
No resources found.

I still have increased the pod’s definition requests and so far there is enough resources in the pod, but the follower config is still hanging.

I still haven’t found any log or flag I can enable to be able to follow-up the details of the configure command. Is that possible ?

$ oc exec conjur-follower-78967f5f7-jlss6 'nproc'
2

$ oc exec conjur-follower-78967f5f7-jlss6 'free'
              total        used        free      shared  buff/cache   available
Mem:        8161840     2160752     4068084       19708     1933004     5780464
Swap:             0           0           0


$ oc cp follower-seed.tar conjur-follower-78967f5f7-jlss6:/opt/follower-seed.tar

$ oc exec conjur-follower-78967f5f7-jlss6 evoke unpack seed /opt/follower-seed.tar
tar: Removing leading `/' from member names
Seed file was successfully unpacked.
Run 'evoke configure follower' to configure this machine.

$ oc exec conjur-follower-78967f5f7-jlss6 evoke configure follower
...hanging...

joalmaraz · November 13, 2019, 4:02am

So I noted that the 8_configure_followers.sh script hardcodes the copy of the seed file to /tmp:

configure_follower() {
  local pod_name=$1

  KEYS_COMMAND=""

  printf "Configuring follower %s...\n" $pod_name

  copy_file_to_container $FOLLOWER_SEED "/tmp/follower-seed.tar" "$pod_name"

While the deployment (openshift/conjur-follower.yaml) configures the SEEDFILE_DIR variable to /tmp/seedfile:

      - name: SEEDFILE_DIR
        value: /tmp/seedfile

So the /tmp/seedfile/start-follower.sh script was looking for the seed file in the wrong directory:

if [[ -f "$SEEDFILE_DIR/follower-seed.tar" ]]; then

I have updated the deployment’s environment variable to /tmp and now the script executed during the pod’s startup:

      - name: SEEDFILE_DIR
        value: /tmp/

So now at least I am able to see the follower’s configs steps in the journal/container logs:

+ exec conjur-plugin-logger etcd
2019-11-13T03:50:21.000+00:00 conjur-follower-797555578f-6b4vf cron[45]: (CRON) INFO (pidfile fd = 3)
2019-11-13T03:50:21.000+00:00 conjur-follower-797555578f-6b4vf cron[45]: (CRON) INFO (Running @reboot jobs)
Unpacking seed...
2019-11-13T03:50:21.000+00:00 conjur-follower-797555578f-6b4vf evoke-health: [2019-11-13 03:50:25] INFO  WEBrick 1.4.2
2019-11-13T03:50:25.000+00:00 conjur-follower-797555578f-6b4vf evoke-health: [2019-11-13 03:50:25] INFO  ruby 2.5.5 (2019-03-15) [x86_64-linux-gnu]
2019-11-13T03:50:25.000+00:00 conjur-follower-797555578f-6b4vf evoke-health: [2019-11-13 03:50:25] INFO  WEBrick::HTTPServer#start: pid=51 port=5610
2019-11-13T03:50:21.000+00:00 conjur-follower-797555578f-6b4vf evoke-seed: [2019-11-13 03:50:26] INFO  WEBrick 1.4.2
2019-11-13T03:50:26.000+00:00 conjur-follower-797555578f-6b4vf evoke-seed: [2019-11-13 03:50:26] INFO  ruby 2.5.5 (2019-03-15) [x86_64-linux-gnu]
2019-11-13T03:50:26.000+00:00 conjur-follower-797555578f-6b4vf evoke-seed: [2019-11-13 03:50:26] INFO  WEBrick::HTTPServer#start: pid=47 port=5612
2019-11-13T03:50:21.000+00:00 conjur-follower-797555578f-6b4vf evoke-info: [2019-11-13 03:50:26] INFO  WEBrick 1.4.2
2019-11-13T03:50:26.000+00:00 conjur-follower-797555578f-6b4vf evoke-info: [2019-11-13 03:50:26] INFO  ruby 2.5.5 (2019-03-15) [x86_64-linux-gnu]
2019-11-13T03:50:26.000+00:00 conjur-follower-797555578f-6b4vf evoke-info: [2019-11-13 03:50:26] INFO  WEBrick::HTTPServer#start: pid=54 port=5611
tar: Removing leading `/' from member names
Seed file was successfully unpacked.
Run 'evoke configure follower' to configure this machine.
Configuring follower...
2019-11-13 03:50:29.714 UTC [184] LOG:  database system was shut down at 2019-11-13 03:50:29 UTC

joalmaraz · November 13, 2019, 4:05am

I am now able to see some replication processes running in the container, but the evoke configure processes are not finishing, so the readiness probe is still failing in port 443.

root 46 0.0 0.0 26360 1456 ? root 47 0.2 0.6 208820 55264 ? root 48 0.0 0.0 26360 1456 ? root 49 0.0 0.0 26360 1396 ? root 50 0.0 0.0 26360 1404 ? conjur 51 0.2 0.6 200208 54672 ? root 52 0.0 0.0 26360 1456 ? root 53 0.0 0.0 26360 1372 ? conjur 54 0.2 0.6 200296 54560 ? root 59 0.0 0.0 21640 3440 ? root 62 0.0 0.0 4400 828 ? root 63 0.0 0.0 4400 788 ? postgres 75 0.0 0.2 276440 21904 ? > root 144 0.0 > root 146 0.1 root 153 0.0 0.0 4628 836 ? root 154 0.0 0.0 88408 6184 ? postgres 185 0.0 0.0 276440 3548 ? postgres 186 0.0 0.0 276440 5012 ? postgres 187 0.0 0.0 276440 3548 ? postgres 188 0.0 0.0 276844 5756 ? postgres 189 0.0 0.0 131456 2484 ? S 03:50 0:00 /usr/bin/logger -p local0 info -t evoke-seed
Sl 03:50 0:02 ruby2.5 /opt/conjur/evoke/vendor/bundle/ruby/2.5.0/bin/rackup seed.ru -s webrick -p 5612 -o 127.0.0.1
S 03:50 0:00 /usr/bin/logger -p local0 info -t etcd-proxy
S 03:50 0:00 /usr/bin/logger -p local0 info -t etcd
S 03:50 0:00 /usr/bin/logger -p local0 info -t evoke-health
Sl 03:50 0:02 ruby2.5 /opt/conjur/evoke/vendor/bundle/ruby/2.5.0/bin/rackup health.ru -s webrick -p 5610 -o 127.0.0.1
S 03:50 0:00 /usr/bin/logger -p local0 info -t evoke-cluster
S 03:50 0:00 /usr/bin/logger -p local0 info -t evoke-info
Sl 03:50 0:02 ruby2.5 /opt/conjur/evoke/vendor/bundle/ruby/2.5.0/bin/rackup info.ru -s webrick -p 5611 -o 127.0.0.1
S 03:50 0:00 /bin/bash ./run
S 03:50 0:00 runsv audit
S 03:50 0:00 runsv main
S 03:50 0:00 /usr/lib/postgresql/9.4/bin/postgres -c config_file=/etc/postgresql/9.4/audit/postgresql.conf
0.0 18376 3120 ? S 03:50 0:00 /bin/bash /usr/local/bin/evoke configure follower
0.5 217356 47344 ? Sl 03:50 0:01 /opt/conjur/evoke/bin/evoke configure follower
S 03:50 0:00 sh -c psql "user=* replication=yes host=conj**
S 03:50 0:00 /usr/lib/postgresql/9.4/bin/psql user=* repl**
Ss 03:50 0:00 postgres: checkpointer process
Ss 03:50 0:00 postgres: writer process
Ss 03:50 0:00 postgres: wal writer process
Ss 03:50 0:00 postgres: autovacuum launcher process
Ss 03:50 0:00 postgres: stats collector process

netcat -v localhost 443

netcat: connect to localhost port 443 (tcp) failed: Connection refused
netcat: connect to localhost port 443 (tcp) failed: Connection refused

joalmaraz · November 14, 2019, 2:55am

From one of the running pods:

/usr/bin/sv status /etc/service/*

down: /etc/service/cluster: 354s; run: log: (pid 50) 354s
down: /etc/service/conjur: 354s
run: /etc/service/cron: (pid 44) 354s
down: /etc/service/etcd: 354s; run: log: (pid 51) 354s
down: /etc/service/etcd-proxy: 354s; run: log: (pid 47) 354s
down: /etc/service/evoke: 354s
run: /etc/service/health: (pid 55) 354s; run: log: (pid 46) 354s
run: /etc/service/info: (pid 54) 354s; run: log: (pid 53) 354s
down: /etc/service/nginx: 354s
run: /etc/service/pg: (pid 45) 354s
run: /etc/service/seed: (pid 49) 354s; run: log: (pid 48) 354s
down: /etc/service/sshd: 354s
down: /etc/service/syslog-forwarder: 354s
run: /etc/service/unlocker: (pid 52) 354s

joalmaraz · November 21, 2019, 5:38am

I wasn’t, unfortunately, able to fix this one yet. I have commented some unnecessary steps like pushing the docker image again, etc and I am able to run the “start” script from https://github.com/cyberark/kubernetes-conjur-deploy, the pod stays running fine but without completing the “evoke configure follower”. Will hopefully find some time to troubleshoot again, but any inputs are welcome

jake · November 27, 2019, 6:36pm

Hey Jose,

Any luck with this yet?

joalmaraz · November 27, 2019, 9:52pm

Hi Jake,

Thanks for checking out. Unfortunately no, there isn’t much debug/trace logs I could find that would assist on making progress with the troubleshooting.

Similar to: Configure standby failing

Cheers.

hbindra · April 22, 2020, 12:25pm

Hey,

I wanted to check if you found a solution for this

Regards

joalmaraz · April 30, 2020, 11:16pm

Hi @hbindra, sorry for the delay. No, unfortunately I was not able to determine the solution and stopped working on it. Curious to know if anyone else faced a similar issue and how they resolved.

Thanks.

Topic		Replies	Views
Follower logs to troubleshoot for container not ready Conjur Enterprise	5	1107	June 24, 2020
Follower installaion on Openshift, doubt on follower configuration script Guides and HowTo's	1	508	August 14, 2020
Authenticator pod failing to authenticate Secrets Management - Conjur, Secrets Hub & CP	21	7820	November 18, 2019
Failed to produce seed file deploying followers in k8s cluster Cert-based auth Conjur Enterprise kubernetes , conjur	4	343	April 14, 2023
Conjur follower installation in minikube cluster Secrets Management - Conjur, Secrets Hub & CP kubernetes	4	993	August 1, 2020