GurimoA

Rails Is Slow

Rails is slow, is what you hear from people when they complain about how under performing a Rails application is, but why is Rails slow? I've heard various kind of responses from people over time, "because Ruby is slow", "Because Rails is bloated" but is that true? That is what I will attempt to demystify in this post by performing some simple tests.

The Tests

I have decided to perform the tests by having just Rails out of the box as it is with 2 different Rack application server, which I think are the most commonly used among Rails deployment, Unicorn and Puma.

What I am testing here is the requests per second handled by the application server, specifically the the application logic. What I am not testing here is serving of static assets which any experienced Rails developer should know is best fulfilled by Nginx.

For the tests I have prepared 2 endpoints:

To test the endpoints I will be sending traffic to them with Plow.

In order for the tests to be as transparent as possible I've made a GitHub repository that you can use to run the test on your own machine with ease because it is dockerized.

Rails With Unicorn

The first test will be Unicorn with 1 worker and 1 CPU, which can be started after building the image with:

docker run -p 8080:3000 --cpus 1 -e RAILS_ENV=production magi-ks-rails-benchmark bundle exec unicorn -c config/unicorn.conf

testing with the / endpoint with concurrency of 1:

docker run --rm --net=host ghcr.io/six-ddc/plow -c 1 'http://localhost:8080/'

We get this result:

  Elapsed    37.564s
  Count        20955
    2xx        20955
  RPS        557.834
  Reads    0.292MB/s
  Writes   0.031MB/s

Statistics   Min     Mean    StdDev    Max
  Latency   1.32ms  1.786ms  407µs   17.923ms
  RPS       526.04  557.79   11.96    578.05

Latency Percentile:
  P50        P75      P90      P95      P99    P99.9    P99.99
  1.725ms  1.865ms  2.064ms  2.295ms  2.934ms  8.06ms  14.606ms

Latency Histogram:
  1.714ms   16829  80.31%
  1.943ms    3244  15.48%
  2.278ms     623   2.97%
  2.956ms     235   1.12%
  4.851ms      13   0.06%
  9.192ms       6   0.03%
  10.99ms       4   0.02%
  14.606ms      1   0.00%

557 requests per second, looks pretty good keep in mind this is without any complicated application logic, taking into account application logic it can easily fall to 20% of this throughput.

Now lets try the /simulated_io endpoint with delay of 1 second and concurrency of 1:

docker run --rm --net=host ghcr.io/six-ddc/plow -c 1 'http://localhost:8080/simulated_io?delay=1'

And the result we got is:

  Elapsed     42.44s
  Count           42
    2xx           42
  RPS          0.990
  Reads    0.001MB/s
  Writes   0.000MB/s

Statistics    Min       Mean     StdDev     Max
  Latency   1.00556s  1.007463s  868µs   1.009704s
  RPS         0.49      0.98      0.07       1

Latency Percentile:
  P50         P75        P90        P95        P99       P99.9     P99.99
  1.0076s  1.008028s  1.008325s  1.008532s  1.009704s  1.009704s  1.009704s

Latency Histogram:
  1.005696s   3   7.14%
  1.006784s  10  23.81%
  1.007571s  17  40.48%
  1.007958s   5  11.90%
  1.008378s   5  11.90%
  1.009067s   2   4.76%

The Performance Thief

Nearly 1 request per second! what happened here? Is Ruby slow? Is Rails slow? it is neither. It has more to do with how IO is handled by the process, IO here can be anything from Network IO to Disk IO. When the code executes to the point where it needs to perform IO, it needs to wait for the result of the IO before being able to continue with the code execution.

What really happened here is that we simulated IO with the sleep system call, which tells the OS scheduler to not run the process until after 1 second has passed, with an actual IO operation it'll be similar because the process would be waiting for the IO result and not able to continue execution until the IO result is returned.

Ok, so the IO is blocking the code execution for that particular request the process can handle other request at the mean time yes?

Let's try it out with /simulated_io endpoint with delay of 1 second and concurrency of 5:

docker run --rm --net=host ghcr.io/six-ddc/plow -c 5 'http://localhost:8080/simulated_io?delay=1'

We get this result:

  Elapsed   1m2.331s
  Count           61
    2xx           61
  RPS          0.979
  Reads    0.001MB/s
  Writes   0.000MB/s

Statistics     Min       Mean      StdDev      Max
  Latency   1.006888s  4.868341s  692.09ms  5.036783s
  RPS         0.49       0.99       0.06        1

Latency Percentile:
  P50           P75        P90        P95        P99       P99.9     P99.99
  5.032965s  5.034704s  5.035542s  5.036034s  5.036783s  5.036783s  5.036783s

Latency Histogram:
  1.006888s   1   1.64%
  2.014008s   1   1.64%
  3.019441s   1   1.64%
  4.025659s   1   1.64%
  5.031101s   9  14.75%
  5.032392s  20  32.79%
  5.034354s  19  31.15%
  5.035819s   9  14.75%

Still about 1 request per second but wait, the average response time has gone up to 5 seconds! what happened here? The problem here is that the Unicorn application server is only able to handle 1 request at a time, if the request needs to wait for 1 second for the IO to complete then the next request inline will have to wait for the first request to finish and the 3rd request would have to wait for both the first and second request to finish before being processed and so on.

"This is why Rails can't scale!" you might shout but that is not entirely true. If we take "scaling" as being able to handle X amount of request per second then all you need to do is to increase the worker count.

Starting the Unicorn application server with 1 CPU and 5 workers:

docker run -p 8080:3000 --cpus 1 -e RAILS_ENV=production -e WORKER_COUNT=5 magi-ks-rails-benchmark bundle exec unicorn -c config/unicorn.conf

Then we hit /simulated_io endpoint with the same delay of 1 second and concurrency of 5:

docker run --rm --net=host ghcr.io/six-ddc/plow -c 5 'http://localhost:8080/simulated_io?delay=1'

Result:

Summary:
  Elapsed    30.523s
  Count          150
    2xx          150
  RPS          4.914
  Reads    0.003MB/s
  Writes   0.000MB/s

Statistics     Min       Mean     StdDev      Max
  Latency   1.003883s  1.008427s  4.677ms  1.033041s
  RPS         2.49       4.91      0.46        5

Latency Percentile:
  P50           P75        P90        P95        P99       P99.9     P99.99
  1.007395s  1.008314s  1.009335s  1.019545s  1.032231s  1.033041s  1.033041s

Latency Histogram:
  1.005962s  19  12.67%
  1.007085s  63  42.00%
  1.008411s  56  37.33%
  1.012531s   7   4.67%
  1.024546s   1   0.67%
  1.025855s   1   0.67%
  1.030089s   1   0.67%
  1.032636s   2   1.33%

Cool we get 5 requests per second even with the simulated IO, all we gotta do is to have the amount of worker that is equivalent to the highest concurrent request we'll ever get, problem solved.

While that is correct but what is the cost? Unicorn creates more worker by process forking, each process would require their own memory space. An average Rails application would be around 200MB per process so if you need to be able to handle 100 concurrent request then you will need at least 100 * 200MB or memory available which is 20GB of RAM just to handle 100 concurrent request within 1 second.

Well that sounds expensive to run but it scales I guess, no choice but to fork the cash over to the cloud provider. Well not so fast, Unicorn isn't the only Rack application server available, we have also Puma.

Rails With Puma

Puma is the default application server for Rails since Rails 5 and there is a good reason for it. Puma is able to have worker threads, a thread requires a lot less memory to maintain compared to a another process. So puma is be able to achieve what Unicorn is able to achieve with much less memory requirement.

Let's do some test, we can start up Puma with 1 CPU and 100 threads:

docker run -p 8080:3000 --cpus 1 -e RAILS_MAX_THREADS=100 -e RAILS_ENV=production magi-ks-rails-benchmark bundle exec rails s -b 0.0.0.0

And hit the simulated_io endpoint with delay of 1 second and concurrency of 100:

docker run --rm --net=host ghcr.io/six-ddc/plow -c 100 'http://localhost:8080/simulated_io?delay=1'

Then we inspect the container's stats with docker stats:

CONTAINER ID   NAME               CPU %     MEM USAGE / LIMIT    MEM %     NET I/O           BLOCK I/O     PIDS
7bed122ecc9e   xenodochial_pike   40.05%    140.4MiB / 15.4GiB   0.89%     1.09MB / 4.53MB   0B / 8.66MB   107

Result from Plow:

  Elapsed  1m22.712s
  Count         8090
    2xx         8090
  RPS         97.809
  Reads    0.046MB/s
  Writes   0.007MB/s

Statistics     Min       Mean     StdDev      Max
  Latency   1.001338s  1.015458s  28.11ms  1.314588s
  RPS         49.98      98.5      6.04      100.1

Latency Percentile:
  P50           P75        P90        P95        P99       P99.9     P99.99
  1.006184s  1.013806s  1.035318s  1.054969s  1.164423s  1.279384s  1.314588s

Latency Histogram:
  1.00544s   4242  52.44%
  1.010906s  2314  28.60%
  1.027732s   916  11.32%
  1.058625s   459   5.67%
  1.11265s    116   1.43%
  1.260034s    37   0.46%
  1.285833s     5   0.06%
  1.314588s     1   0.01%

So we manged to achieve 100 requests per second even with the simulated IO workload by having 100 worker threads, and all the request was served in about 1 second. For comparison sake let's try to achieve the same thing with Unicorn.

Unicorn with 1 CPU and 100 workers:

docker run -p 8080:3000 --cpus 1 -e RAILS_ENV=production -e WORKER_COUNT=100 magi-ks-rails-benchmark bundle exec unicorn -c config/unicorn.conf

Plow hitting simulated_io endpoint with 100 concurrency:

docker run --rm --net=host ghcr.io/six-ddc/plow -c 100 'http://localhost:8080/simulated_io?delay=1'

Docker stats:

CONTAINER ID   NAME                CPU %     MEM USAGE / LIMIT    MEM %     NET I/O           BLOCK I/O        PIDS
bb46d11eaa1d   beautiful_neumann   68.40%    4.069GiB / 15.4GiB   26.41%    5.32MB / 10.4MB   209kB / 12.4MB   202

Plow result:

  Elapsed  2m44.102s
  Count        15497
    2xx        15497
  RPS         94.435
  Reads    0.049MB/s
  Writes   0.007MB/s

Statistics     Min       Mean     StdDev       Max
  Latency   1.002559s  1.05568s  379.318ms  7.698262s
  RPS         14.99     94.92      15.69     100.12

Latency Percentile:
  P50           P75        P90        P95       P99      P99.9     P99.99
  1.008493s  1.009552s  1.010977s  1.080642s  2.99003s  6.60033s  7.698262s

Latency Histogram:
  1.018072s  15219  98.21%
  1.520166s    138   0.89%
  3.486064s     50   0.32%
  4.470955s     43   0.28%
  5.667286s     23   0.15%
  6.176949s     12   0.08%
  7.030209s     11   0.07%
  7.698262s      1   0.01%

Look at that memory usage difference between Unicorn and Puma 4GB VS 140MB and Unicorn had some hiccups with some request taking up to 7 seconds to serve. Keep in mind these are the results of this synthetic benchmark to illustrate an example, real world results might be slightly different but in line with what we see here.

Well I guess all we need to do is to switch to Puma and add lots of worker thread, yes but there's a catch, you need to make sure your application is thread safe if you wish to use Puma worker threads. Rails and its dependencies are thread safe since Rails 3 but you need to make sure that the Gems that you depend on are thread safe on your own.

What does thread safe mean? Because different threads can run at different times we need to make sure that there are no shared state between the threads. A mundane example would be a global variable that your code depends on, if a request stores a temporary variable on the global variable, then the OS scheduler decides to swap the thread out to work on another thread that also writes to the same global variable, when the original thread gets swapped back to continue running the value stored on the global variable might not be what we wanted any more. This can be a huge pain to debug as we may not know exactly how it happens.

So what have we learned here?

"That is a lot of things and caveats to know about to get performant Rails" you may think but this is only the most common case, there are other ways to squeeze performance out of Ruby and Rack applications.

This concludes the brief look through on why Rails is slow, the GitHub repository to run the benchmark shown in the post is located here feel free to play around or use it to spread the knowledge.