Scaling the Ruby API?

Is there any info around getting the Ruby API to scale and perform faster? It’s always been slow loading/updating policies, but now I’m beginning to see connections being refused.

Auth is working and not all requests fail of course, but some do. I suspect due to bottlenecks.

I increased the WEB_CONCURRENCY and RAILS_MAX_THREADS (doubling them) and it may have helped (or traffic may have calmed down). We don’t have a ton of traffic either, so I’m surprised to be looking at scaling options here…but as best as I can tell there’s a performance issue somewhere.

Hi @tom_m, thank you for posting. Could you please provide some context around your deployment to help us dig into this further ?

For example,

  • How the deployment is setup
  • The sort of traffic the API is handling
  • Logs around the connection refusals

Hi kumbirai, thanks for replying.

The deployment is in Kubernetes. There’s honestly not a lot of traffic (I’d have to look up numbers, but we’re talking about tens of thousands of requests per day, not hundreds or millions or anything like that). Conjur is running in a single container in a pod and connects to GCP managed Postgres database. The database is all happy and there are no issues there from looking at its monitoring, slow query log, etc. So I turned my attention to the Ruby API (though have not spent much time using or measuring the CLI).

I turned on debug level logging and it became pretty noisy, but also didn’t tell me anything from what I could gather. The actual error message my application connecting to the API was (replacing IP address and account/host info with x’s):

request to http://conjur/authn/xxx/host%2Fxxx%2Fxxx/authenticate failed, reason: connect ECONNREFUSED xxx.xxx.xxx.xxx:80

Pretty basic just connection refused message. However, some requests were making it through of course.

What I’ve noticed is that it takes a long time to update or create policies and it seemingly is taking longer and longer. The policies are nested under the root policy, I believe best practices were followed (in fact, a former Cyberark employee set up the structure). It was taking maybe 3? Seconds or so to load or update a simple policy…Certainly no more than 5. Now it seems to take double that and I’ve seen it take as long as 14 seconds.

So my thought was that if one of those API requests were being made, it left the API a bit busy to handle other requests. Scaling up the Kubernetes deployment to include multiple pods to load balance seemed to present some issues with concurrency and multiple requests trying to update the same resource at the same time (presumably users in the application refreshing and making the same request again because things were slow). Policy updates, grants, etc. all slow. Reads seem ok though.

What I was able to do was increase the WEB_CONCURRENCY and RAILS_MAX_THREADS, doubling their default values. Since doing that I have not seen any connections being refused now in the past few days. Hooray! So that may have done the trick for now at least and answered my original question.

Unless there is a way to bring on multiple pods, I assume these environment variables are one way (or the main way?) to scale the Ruby API.

I’m still concerned by the write performance here though. Maybe I’m not making good use of branching with policies?

Hi Tom, thank you for providing that context.

I’m still unsure about the ECONNREFUSED. I would like to carry out some load tests to see what I can find. I’ll report back once I’m done.

There’s honestly not a lot of traffic (I’d have to look up numbers, but we’re talking about tens of thousands of requests per day, not hundreds or millions or anything like that)

I’m wondering if there are bursts at any points in the day. It’s likely that the server would struggle to handle some bursts. Would you be able to find out what the average and maximum req/s second are during any given day ?

It was taking maybe 3? Seconds or so to load or update a simple policy…Certainly no more than 5.

I wouldn’t expect it to take so long. Would you be able to provide some sample policy for me to run some tests ?

So my thought was that if one of those API requests were being made, it left the API a bit busy to handle other requests.

Many writes are transactional and so it’s likely that you will get 5XXs back as a result of PG preventing conflicting writes.

It might be worth exploring having all the writes go to a single container and the reads go to others. However, if you have a lot of policy changes then this approach might not be so useful.

Would you be able to provide some context around the policy updates ? How frequent are they and what’s the general structure and size of the policy tree ?

Conjur OSS has a WIP branch that seeks to introduce Telemetry, including some metrics that could provide insight into performance. Is that something you might interested in exploring ?