The capillaries-deploy project has been put on hold, Terraform is now the preferred tool for testing Capillaries deployments. Another round of scalability testing was performed. As before, we use setups where:
Here is what has changed since the last round:
As before, portfolio test is used for performance benchmarking. The updated results are presented below.
Deployment flavor | AWS instance | 4 Cassandra nodes | 8 Cassandra nodes | 16 Cassandra nodes | 32 Cassandra nodes | Cassandra | Daemon | Bastion/RabbitMQ/Prometheus | Total cores | Hourly cost | Total cores | Hourly cost | Total cores | Hourly cost | Total cores | Hourly cost |
---|---|---|---|---|---|---|---|---|---|---|---|
aws.arm64.c7g.8 | c7gd.2xlarge | c7g.large | c7g.large | 42 | $1.8141 | 82 | $3.5557 | 162 | $7.0389 | 322 | $14.0053 |
aws.arm64.c7g.16 | c7gd.4xlarge | c7g.xlarge | c7g.large | 82 | $3.5557 | 162 | $7.0389 | 322 | $14.0053 | ||
aws.arm64.c7g.32 | c7gd.8xlarge | c7g.2xlarge | c7g.large | 162 | $7.0385 | 322 | $14.0045 | ||||
aws.arm64.c7g.64 | c7gd.16xlarge | c7g.4xlarge | c7g.large | 322 | $14.0045 |
output_cassandra_cpus = "8 cpus ($0.362900/hr)" output_cassandra_hosts = 4 output_cassandra_total_cpus = 32 output_daemon_cpus = "2 cpus ($0.072500/hr)" output_daemon_instances = 4 output_daemon_thread_pool_size = 6 output_daemon_total_cpus = 8 output_daemon_writer_workers = 12 output_daemon_writers_per_cassandra_cpu = 9 output_total_ec2_hourly_cost = "$1.814100/hr"
output_cassandra_cpus = "8 cpus ($0.362900/hr)" output_cassandra_hosts = 8 output_cassandra_total_cpus = 64 output_daemon_cpus = "2 cpus ($0.072500/hr)" output_daemon_instances = 8 output_daemon_thread_pool_size = 6 output_daemon_total_cpus = 16 output_daemon_writer_workers = 12 output_daemon_writers_per_cassandra_cpu = 9 output_total_ec2_hourly_cost = "$3.555700/hr"
output_cassandra_cpus = "8 cpus ($0.362900/hr)" output_cassandra_hosts = 16 output_cassandra_total_cpus = 128 output_daemon_cpus = "2 cpus ($0.072500/hr)" output_daemon_instances = 16 output_daemon_thread_pool_size = 6 output_daemon_total_cpus = 32 output_daemon_writer_workers = 12 output_daemon_writers_per_cassandra_cpu = 9 output_total_ec2_hourly_cost = "$7.038900/hr"
output_cassandra_cpus = "8 cpus ($0.362900/hr)" output_cassandra_hosts = 32 output_cassandra_total_cpus = 256 output_daemon_cpus = "2 cpus ($0.072500/hr)" output_daemon_instances = 32 output_daemon_thread_pool_size = 6 output_daemon_total_cpus = 64 output_daemon_writer_workers = 12 output_daemon_writers_per_cassandra_cpu = 9 output_total_ec2_hourly_cost = "$14.005300/hr"
output_cassandra_cpus = "16 cpus ($0.725800/hr)" output_cassandra_hosts = 4 output_cassandra_total_cpus = 64 output_daemon_cpus = "4 cpus ($0.145000/hr)" output_daemon_instances = 4 output_daemon_thread_pool_size = 12 output_daemon_total_cpus = 16 output_daemon_writer_workers = 12 output_daemon_writers_per_cassandra_cpu = 9 output_total_ec2_hourly_cost = "$3.555700/hr"
output_cassandra_cpus = "16 cpus ($0.725800/hr)" output_cassandra_hosts = 8 output_cassandra_total_cpus = 128 output_daemon_cpus = "4 cpus ($0.145000/hr)" output_daemon_instances = 8 output_daemon_thread_pool_size = 12 output_daemon_total_cpus = 32 output_daemon_writer_workers = 12 output_daemon_writers_per_cassandra_cpu = 9 output_total_ec2_hourly_cost = "$7.038900/hr"
output_cassandra_cpus = "16 cpus ($0.725800/hr)" output_cassandra_hosts = 16 output_cassandra_total_cpus = 256 output_daemon_cpus = "4 cpus ($0.145000/hr)" output_daemon_instances = 16 output_daemon_thread_pool_size = 12 output_daemon_total_cpus = 64 output_daemon_writer_workers = 12 output_daemon_writers_per_cassandra_cpu = 9 output_total_ec2_hourly_cost = "$14.005300/hr"
output_cassandra_cpus = "32 cpus ($1.451500/hr)" output_cassandra_hosts = 4 output_cassandra_total_cpus = 128 output_daemon_cpus = "8 cpus ($0.290000/hr)" output_daemon_instances = 4 output_daemon_thread_pool_size = 24 output_daemon_total_cpus = 32 output_daemon_writer_workers = 12 output_daemon_writers_per_cassandra_cpu = 9 output_total_ec2_hourly_cost = "$7.038500/hr"
output_cassandra_cpus = "32 cpus ($1.451500/hr)" output_cassandra_hosts = 8 output_cassandra_total_cpus = 256 output_daemon_cpus = "8 cpus ($0.290000/hr)" output_daemon_instances = 8 output_daemon_thread_pool_size = 24 output_daemon_total_cpus = 64 output_daemon_writer_workers = 12 output_daemon_writers_per_cassandra_cpu = 9 output_total_ec2_hourly_cost = "$14.004500/hr"
output_cassandra_cpus = "64 cpus ($2.903000/hr)" output_cassandra_hosts = 4 output_cassandra_total_cpus = 256 output_daemon_cpus = "16 cpus ($0.580000/hr)" output_daemon_instances = 4 output_daemon_thread_pool_size = 48 output_daemon_total_cpus = 64 output_daemon_writer_workers = 12 output_daemon_writers_per_cassandra_cpu = 9 output_total_ec2_hourly_cost = "$14.004500/hr"
While overall performance numbers have improved, scalability patterns remain largely unchanged.
Selecting the thread_pool_size and writer_workers daemon parameters can feel somewhat mysterious. Common sense suggests that thread_pool_size should correlate with the number of CPUs/cores, while writer_workers reflects "how many writes per second my Cassandra cluster can handle". Let's examine a typical CPU usage pattern observed in a low-end VM setup.
There are clearly three zones in this CPU usage pattern:
The objective is to maintain relatively high CPU utilization for both Daemon and Cassandra instances across all 3 zones, while also avoiding situations whenre too many daemon threads are idling - waiting for Cassandra read/write operations to complete. Here are some tuning guidelines.
In the example shown in the CPU profile above, we used "thread_pool_size = number_of_cores x 3" and "writer_workers = 12":
output_cassandra_cpus = "8 cpus ($0.362900/hr)" output_cassandra_hosts = 4 output_cassandra_total_cpus = 32 output_daemon_cpus = "2 cpus ($0.072500/hr)" output_daemon_instances = 4 output_daemon_thread_pool_size = 6 output_daemon_total_cpus = 8 output_daemon_writer_workers = 12 output_daemon_writers_per_cassandra_cpu = 9 output_total_ec2_hourly_cost = "$1.814100/hr"
Why these values? Mainly because they allow us to avoid OOM errors during the "load" phase while maintaining relatively high Casandra CPU usage during both the "load" and "joins" phases.
"thread_pool_size = number_of_cores x 3" may be overkill for the "Python calculations" zone - we could likely achieve similar performance with, say, "thread_pool_size = number_of_cores x 1.5". Howerver, Capillaries is not smart enough to throttle thread_pool_size dynamically based on the type of the transform type.
Is it worth increasing writer_workers to push Cassandra harder? Observations show that increasing writer_workers beyond a certain point yields marginal to negligible improvements.
There is a noticeable dip in Cassandra CPU usage at the start of the "load" phase. My theory is that the start of the dip marks the moment when Capillaries daemon garbage collector kicks in consuming CPU cycles that would otherwise support Cassandra writes. This Prometheus metric - "100 * (1 - ((avg_over_time(node_memory_MemFree_bytes[1m]) + avg_over_time(node_memory_Cached_bytes[1m]) + avg_over_time(node_memory_Buffers_bytes[1m])) / avg_over_time(node_memory_MemTotal_bytes[1m])))" - shows memory usage:
For this test, GOMEMLIMIT was set to 75% of total RAM. However, because this is a soft limit (and there are other processes besides Capillaries daemon running on each instance), daemon memory consumption stabilizes around 85%. While this may be aggressive for production environments, it makes sense for performance testing.
More expensive setups (with more instances or/and more powerful instances) complete the test faster, but tend to show higher cost per run. Why? Why is the black dot on both charts so far from the theoretical optimum? Here is the CPU usage profile for a 4-node Cassandra cluster with 64 cores per node:
The simple answer: we were not able to sufficiently engage the daemon instances or Cassandra nodes - CPU usage remained relatively low. But, again, why? There may be a number of explanations for that.
There could be other factors. Further (more expensive) testing with larger datasets might provide answers.