After resolving the original issue (Authenticator pod failing to authenticate) where the seed config wasn’t being downloaded within the Openshift pods from the conjur master node, I have now manually generated the seed file and tried to configure the follower pods.
Unfortunately I am not able to perform the manual follower configuration by generating the seed file, copying to the conjur-appliance pods and running the unpack and configure follower commands.
$ oc exec conjur-follower-78967f5f7-qj7fv evoke unpack seed /opt/follower-seed.tar
tar: Removing leading `/’ from member names
Seed file was successfully unpacked.
Run ‘evoke configure follower’ to configure this machine.
Thinking it could be any network related issue, I have checked from the pods if I can reach the master’s required ports:
netcat -v $ConjurMasterAddress 443
Connection to $ConjurMasterAddress 443 port [tcp/https] succeeded!
^C
netcat -v $ConjurMasterAddress 5432
Connection to $ConjurMasterAddress 5432 port [tcp/postgresql] succeeded!
^C
netcat -v conjurmaster.dtt-iam.xyz 1999
Connection to $ConjurMasterAddress 1999 port [tcp/*] succeeded!
^C
I have intentionally blocked the 5432 port in a security group rule, for example, and the configure follower command at least timed out, but now the command just hangs for hours with no response.
Is there any debug/trace log I can enable or check ? The pods/journal logs only don’t show anything useful so far.
After a very long time (more likely TCP timeout it finally failed):
psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Still looking for postgres log to understand the cause of the issue, but no progress so far on where exactly the “evoke configure follower” command is hanging.
The follower container doesn’t have enough resources assigned to it. In OCP environments, I’ve found that most have a default policy defined at the cluster level that define resource limits of ~500mCPU and 256mb of memory or something like that. This isn’t enough for the follower’s PG database to start up. The server closed connection unexpectedly is just a side effect of PG slowly trying to replicate the database with no memory or CPU to help it along and the connection finally getting interrupted by something. I’d recommend upping your resource limits for the pod so that each follower container gets 2 full CPU and 4GB of memory. I typically deploy a pod with 4 cores and 8GB of memory available so I can deploy a pair of followers. Obviously, if you expect to need more followers, increase the amount of memory/CPU available to the pod accordingly.
I am now able to see some replication processes running in the container, but the evoke configure processes are not finishing, so the readiness probe is still failing in port 443.
I wasn’t, unfortunately, able to fix this one yet. I have commented some unnecessary steps like pushing the docker image again, etc and I am able to run the “start” script from https://github.com/cyberark/kubernetes-conjur-deploy, the pod stays running fine but without completing the “evoke configure follower”. Will hopefully find some time to troubleshoot again, but any inputs are welcome
Thanks for checking out. Unfortunately no, there isn’t much debug/trace logs I could find that would assist on making progress with the troubleshooting.
Hi @hbindra, sorry for the delay. No, unfortunately I was not able to determine the solution and stopped working on it. Curious to know if anyone else faced a similar issue and how they resolved.