We have a multi-master DAP cluster with one active master and two standbys. Several followers are running within Kubernetes clusters. Each master has its own hostname:
An alias record, “ca-dap-dev.neovest.com”, resolves to the hostname of the active master. It’s intended that standbys and followers, as well as users, can always use “ca-dap-dev.neovest.com” regardless of which of the 3 masters are currently active. At installation time, this was “ut-ca-master1.neovest.com”. Now, after an upgrade across the cluster to conjur-appliance 11.4.0, this has been changed to reflect the currently-active master: “ut-ca-master2.neovest.com”.
While this works great from a user perspective (I can hit “ca-dap-dev.neovest.com” in the browser and access the UI serviced by master2), the followers and standbys appear to be directly referencing “ut-ca-master1.neovest.com”, which has made it impossible to configure master1 as a standby and is causing failures within the followers.
From master2, I have copied the seed to master1 according to the documentation for configuring an HA cluster:
[cyberark@ut-ca-master2 ~]$ docker exec dap evoke seed standby ut-ca-master1.neovest.com | ssh email@example.com "docker exec -i dap evoke unpack seed -"
And on master1 attempted to configure it as a standby. This does not work, as master1 tries to replicate from itself instead of using the load balancer hostname to find the active master:
[cyberark@ut-ca-master1 ~]$ docker exec dap evoke configure standby psql: could not connect to server: No route to host Is the server running on host "ut-ca-master1.neovest.com" (10.2.0.156) and accepting TCP/IP connections on port 5432?
The followers that were already running in Kubernetes have suddenly started to fail, as they seem to try to connect to master1 directly:
$ kubectl logs -f dap-follower-7649c48fd8-hrzpq <131>1 2020-07-18T16:23:17.000+00:00 dap-follower-7649c48fd8-hrzpq postgres 21989 - [meta sequenceId="97"] [3-1] FATAL: could not connect to the primary server: could not connect to server: Connection refused <131>1 2020-07-18T16:23:17.000+00:00 dap-follower-7649c48fd8-hrzpq postgres 21989 - [meta sequenceId="98"] [3-2] Is the server running on host "ut-ca-master1.neovest.com" (10.2.0.156) and accepting <131>1 2020-07-18T16:23:17.000+00:00 dap-follower-7649c48fd8-hrzpq postgres 21989 - [meta sequenceId="99"] [3-3] TCP/IP connections on port 5432?
This is strange, as the deployment manifest explicitly uses “ca-dap-dev.neovest.com”:
initContainers: - env: - name: CONJUR_SEED_FILE_URL value: https://ca-dap-dev.neovest.com/configuration/neovest/seed/follower
Seeking help on how to configure standbys and followers to leverage the load balancer hostname to remedy these issues. Thanks in advance for any guidance. @jgarabedian