Hi kumbirai, thanks for replying.
The deployment is in Kubernetes. There’s honestly not a lot of traffic (I’d have to look up numbers, but we’re talking about tens of thousands of requests per day, not hundreds or millions or anything like that). Conjur is running in a single container in a pod and connects to GCP managed Postgres database. The database is all happy and there are no issues there from looking at its monitoring, slow query log, etc. So I turned my attention to the Ruby API (though have not spent much time using or measuring the CLI).
I turned on debug level logging and it became pretty noisy, but also didn’t tell me anything from what I could gather. The actual error message my application connecting to the API was (replacing IP address and account/host info with x’s):
request to http://conjur/authn/xxx/host%2Fxxx%2Fxxx/authenticate failed, reason: connect ECONNREFUSED xxx.xxx.xxx.xxx:80
Pretty basic just connection refused message. However, some requests were making it through of course.
What I’ve noticed is that it takes a long time to update or create policies and it seemingly is taking longer and longer. The policies are nested under the root policy, I believe best practices were followed (in fact, a former Cyberark employee set up the structure). It was taking maybe 3? Seconds or so to load or update a simple policy…Certainly no more than 5. Now it seems to take double that and I’ve seen it take as long as 14 seconds.
So my thought was that if one of those API requests were being made, it left the API a bit busy to handle other requests. Scaling up the Kubernetes deployment to include multiple pods to load balance seemed to present some issues with concurrency and multiple requests trying to update the same resource at the same time (presumably users in the application refreshing and making the same request again because things were slow). Policy updates, grants, etc. all slow. Reads seem ok though.
What I was able to do was increase the WEB_CONCURRENCY and RAILS_MAX_THREADS, doubling their default values. Since doing that I have not seen any connections being refused now in the past few days. Hooray! So that may have done the trick for now at least and answered my original question.
Unless there is a way to bring on multiple pods, I assume these environment variables are one way (or the main way?) to scale the Ruby API.
I’m still concerned by the write performance here though. Maybe I’m not making good use of branching with policies?