There is a reference architecture diagram and in that, all three DAP Server (master and standbys) are in the same data center.
Q: What will happen if that entire DC is down? I’m not able to see any DAP server (Standby) on the 2nd DC ( ex: DR DC). Please let me know what will happen when the entire primary data center is down. Reference architecture: DAP Deployment Overview
Hey @nimal,
What will happen is you will no longer be able to write changes to Conjur. That includes rotating secrets, loading policy, and writing audit data. However, the followers located in the 2nd DC will continue to serve application requests for secrets while buffering audit data while it waits a Master to be alive or a Standby promoted to Master.
My recommendation and the best practice, despite it not being reflected in the DAP Deployment Overview architecture, would be to have a 3rd Standby (asynchronous) placed with the followers in the 2nd DC. This way, when the 1st DC goes down, the Standby in DC2 will not be promoted to Master because a quorum cannot be reached based on Raft consensus.
Thanks, @joe.garcia …Appreciated
My customer is small to medium-sized and they wanted to secure only their Puppet environment at the moment. Already they are having Core PAS. Customer sensing that this involves more complex solutions since three are DAP servers, 4 Followers, and 2 synchronizers and load balancers for DAP and Followers (total 7 servers and LBs). Is this the minimum recommended deployment model for DAP? will there be any other alternatives used anywhere?
Also, my customer wanted to secure the DAP Keys (ncipher) I believe they need to procure 7 HSM licenses. is there any workaround to reduce the number of HSM client licenses?
Thanks in Advance!!
@joe.garcia don’t we need to have atleast two standbys running for the quorum rule to kickin and a master will be promoted.
@nimal Please take a look at this thread where we talk in detail about encryption. There were some options mentioned by @nathan.whipple which could be useful for you. Thank you
Hello @joe.garcia ,
I’ve tried to test your recommendation and this is what I’ve found:
If I use an autofailover setup with Master and Sync Standby on DC1 and Async (not potential) on DC2 I get the following result:
- if the master dies the autofailover works as expected
- if the Async (not potential) standby in DC2 dies => no impact
- if the Sync standby dies I can no longer execute writes as they times out (working as intended as for sync autofailover to work correctly you need a sync standby + a potential standby)
- if DC1 goes down no autofailover happens (and this is working as intended as per the rules of raft consensus algorithm). I’ve tried to promote the Async standby to Master but in the end this does not make me any good as the cluster service keeps on labeling the newly elected master as a part of an unhealthy cluster as per the raft consensus you need to have at least 2 nodes up and running to reach consensus. Maybe there is some trick I’m unaware of to remove a node from the autofailover cluster when there is no quorum…
If I misunderstood what you said and what you were really implying was the following:
DC1: Master + Sync Standby + Potential Standby
DC2: Async (not potential) Standby
(all 4 nodes in an autofailover cluster)
Using 4 nodes will still give you just 1 node failure tolerance but you have added one more node, so you find yourself in a worse situation w.r.t. 3 node setup, so I don’t know if it’s a good idea…
The only working setup I could think of is the following:
DC1: Master + Standby Sync + Standby Potential (in an autofailover cluster)
DC2: Async (not potential) Standby NOT joined to the autofailover cluster
In this case:
- if the master goes down => autofailover elects another master
- if the Sync standby goes down => the potential takes its place and you still can execute writes
- it the Potential standby goes down => no impact
- if the Async (not potential) standby in DC2 goes down => no impact
- if DC1 goes down you don’t get any autofailover but at least you can manually promote the Async (not potential) Standby to master.
Let me know if I’m missing something.
Thanks in Advance
If we reflect back on the Raft consensus that is utilized by PostgreSQL and employed here for Conjur, if every node was joined to the auto-failover cluster where 3 nodes (leader, 2 standbys) existed in DC1 and a single standby in DC2 and DC1 went down… there wouldn’t be a quorum to promote the DC2 standby to a leader.
I believe I misspoke in my original reply where I said that the standby would be promoted to leader – that was incorrect. I’ve edited that to reflect this.