Conjur Standby Nodes rebuild issue

Hello Everyone,

Recently we had a restart of start of servers due to security linux patching. Post restart one of the standby container stopped working.
Upon investigation, we have observed the pg service was not running, however all the settings are as expected with respect to the PG settings.

We have rebuild the container on the standby node, below are the observations.

Point to be noted here : Added an additional server to the existing 3 servers in order to facilitate the Data centre recovery scenario, this server does the backups but not part of the cluster.

  1. Standby node has been rebuilt.
  2. On the master, generated the seed file for the standby node
  3. On the standby node, updated the pg settings (max_connections and wal_senders), initiated evoke configure standby
  4. On the standby node, ran the cluster enrolment command (before this removed the node from the master node cluster information)
  5. After this the PG settings roll backed to default, updated the changes and restarted pg service.
  6. Under /opt/conjur/etc/conjur.conf found that the node which is used for emergency standby got added, as a result we are getting below error due to inconsistency in the configurations across the cluster.

Unsure from where the below configuration is fetched, this is impacting the cluster on whole.

root@ceb55c9c2d46:/opt/conjur/etc# cat conjur.conf
CONJUR_ACCOUNT=***
ENABLED=true
CONJUR_MASTER_HOST=:443
CLUSTER_NAME=

CLUSTER_MACHINE_NAME=***
CLUSTER_MACHINE_ADDRESS=**
ETCD_INITIAL_CLUSTER=additionalhost=http://additionalhost:2380,***
ETCD_INITIAL_CLUSTER_STATE=new
DATABASE_URL=postgres:///conjur

Container logs :

<134>1 2021-05-31T06:15:11.000+00:00 3bb3f52ec1d8 etcd - - [meta sequenceId=“2095”] 2021-05-31 06:15:11.544575 E | rafthttp: request sent was ignored (cluster ID mismatch: peer[d4e897d5e29dbf11]=c16fccdd7461ad8d, local=38604ee41b6337a5)

  1. On the second standby node, this was not the case.
  2. On the master node, below is the setting

root@6eb130b51c68:/opt/conjur/etc# cat conjur.conf
CONJUR_ACCOUNT=**
ENABLED=true
CLUSTER_NAME=***
CLUSTER_MACHINE_NAME=***
CLUSTER_MACHINE_ADDRESS=***
ETCD_INITIAL_CLUSTER=****
ETCD_INITIAL_CLUSTER_STATE=new
DATABASE_URL=postgres:///conjur

/etc/chef/solo.json has the additional host added into the cluster member list as well.

Thanks and regards,
Gautam.

@nathan.whipple

Hi Gautam,

These might be difficult to troubleshoot in this forum, so if you need a quicker resolution it might be best to open a support case.

To begin, the recommended way to modify the DB configuration is through the use of a JSON file passed to the evoke configure command. Otherwise, evoke configure will overwrite the postgresql.conf file with the default values. You can also manually edit the postgresql.conf file of an already configured container, but we don’t recommend that approach as it is a bit more manual/harder to automate and easier for things to get missed.

Second, if you modify the max_connections value on the master, then that value must be updated on all of the standbys and followers as well. That said, you should only need to change this from the default of 100 if you expect the number of downstream replicas (standbys+followers) to exceed~80. If you don’t expect that large of a number of downstream replicas, then just modify the max_wal_senders value.

I have a simple bash script I use for my lab environments and I create the JSON file using this snippet:

cat > /tmp/pg_config.json << EOF
{
        "postgresql": {
                "config": {
                        "max_wal_senders": 50
                }
        }
}
EOF

Then feed that to your evoke configure or evoke restore commands with the -j switch and the path to the file. docker exec conjur-appliance evoke configure standby -j /tmp/pg_config.json.

The cluster details in conjur.conf are populated by the evoke cluster enroll command. In the scenario you describe, I don’t believe that command is working correctly/completely, thus why some values are missing. Your DR standby that is not participating in auto-failover should not be part of the auto-failover policy. I haven’t tested the behavior when re-enrolling a previously evicted node in a cluster where more members are listed in the policy than are registered. I’d follow “happy path” here first and see if that sorts out the problem.

Hopefully all that gets you sorted and on your way. Like I said, if this is urgent, please open a support case. If not, feel free to post back with an update and we’ll keep working at it.

Regards,
Nate

Hello @nathan.whipple

I’m replying here and not opening a new thread since my questions are closely related to the answer you gave to gautamkanithi.

As you said (I’ve confirmed the same behaviour in the tests I did in my lab env) changing the max_connections parameter is no trivial task as you also need to change that value in all standbys and followers.

Changing max_wal_senders instead it’s way easier as you can just change that value in /etc/postgresql/{postgresql_ver}/main/postgresql.conf in the master container and do a sv restart pg.
You should also do the same in the standbys just in case there is an autofailover event and one of the standbys become the new master. You can avoid changing followers though, which is big, especially if you have many Kubernetes cluster connected to DAP.

I’ve read the documentation of postgresql regarding max_wal_senders and it says:

WAL sender processes count towards the total number of connections, so the parameter cannot be set higher than max_connections.

But it is not clear how much leeway I should leave between the two parameters. I’ve did some tests in my lab and It seems to be surely greater than 4.
What I get from your answer is that the leeway should be even bigger: ~20, is that correct?
Can you explain me the rationale of this number or point me to some doc (either from Cyberark DAP or Postgresql)?

Another question: since Conjur Secrets Manager (this is the new name, right?) is used in BIG enterprises and they usually have many clusters, wouldn’t it be better to add to the Conjur Secrets Manager documentation a configuration that enables the maximum number of max_wal_senders from the get go (which should be hard capped by postgresql to 200) to avoid expensive reconfiguration in the future?

Something like:

{
    "postgresql" : {
        "config" : {
            "max_connections" : 300,
            "max_wal_senders" : 200
        }
    }
}

AFAIK there is not performance impact to the number of max_wal_senders after the first one (but I may be wrong).

Thanks in advance,
Giorgio

Hi @giorgio,

As you’ve discovered, a delta of 4 is too small. The value of 20 I used in my examples is conservative, and it’s directly related to internal services. So things like the seed-generation endpoint for example will temporarily consume a connection. I prefer leaving this at a conservative value to absorb fluctuations in internal connections and to leave some room for growth if/when additional services are created that connect to the database.

I agree that our OOTB default should be a bit more in line with larger deployments instead of the default postgres value. I suggest opening an enhancement request for this. I’ll also speak with engineering about it and the documentation team as well about the documentation for how to change it. Thanks for your feedback!

Regards,
Nathan