From 5ff790b8d2b04152d0bf851cbf759f3c33cf53e7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lo=C3=AFc=20Hoguin?= Date: Wed, 25 Mar 2020 14:09:51 +0100 Subject: Gun 2.0.0-pre.2 and Cowlib 2.9.0 --- .../content/articles/cowboy2-performance.asciidoc | 80 ++++++++++++++++++++++ 1 file changed, 80 insertions(+) create mode 100644 _build/content/articles/cowboy2-performance.asciidoc (limited to '_build/content/articles/cowboy2-performance.asciidoc') diff --git a/_build/content/articles/cowboy2-performance.asciidoc b/_build/content/articles/cowboy2-performance.asciidoc new file mode 100644 index 00000000..269be791 --- /dev/null +++ b/_build/content/articles/cowboy2-performance.asciidoc @@ -0,0 +1,80 @@ ++++ +date = "2020-12-07T07:00:00+01:00" +title = "Cowboy 2 performance" + ++++ + +https://github.com/sponsors/essen[You can now reward my work via GitHub Sponsors]. + +Recently an article was published by Stressgrid entitled +https://stressgrid.com/blog/cowboy_performance/[Survey of Cowboy Webserver Performance] +that compares Cowboy performance across the different versions +as well as Erlang/OTP versions. The results are not very surprising +to me personally (although the drop is bigger than I expected), +but they might be to others. + +This prompted an experiment that I will now describe in two parts. +The first part is about modifying Cowboy to use `active,N` instead +of `active,once` to reduce the amount of time spent in the TCP +driver. The second part is about writing a stream handler in order +to squeeze the most performance out of Cowboy 2. + +In order to support both HTTP/1.1 and HTTP/2 with a common interface +(as well as HTTP/3 in the future), Cowboy 2 switched from the model +of "one process per connection" to "one process per connection + +one process per request". This is required because from HTTP/2 +onward requests are processed concurrently rather than sequentially, +not to mention the protocols include a number of control messages +that must be handled at the same time. + +But this necessarily has some impact on the performance of HTTP/1.1 +connections, and this is what the Stressgrid benchmarks show. Note +that while I will demonstrate in this article that it is indeed the +use of multiple processes that causes this reduction in performance, +I do not really know why this happens, though. + +After reading the blog post I started experimenting. I took Cowboy's +`hello_world` example and added https://github.com/rabbitmq/looking_glass[Looking Glass] +to the release. I then ran a quick benchmark against the example with +Looking Glass enabled: + +``` erlang +$ make run +... +(hello_world_example@host)1> lg:trace([ + {app, ranch}, {app, cowlib}, {app, cowboy}, {app, stdlib} +], lg_file_tracer, "traces.lz4", #{mode => profile, running => true}). +ok +... Run the benchmark here for a few seconds. +(hello_world_example@host)2> lg:stop(). +ok +(hello_world_example@host)3> lg_callgrind:profile_many("traces.lz4.*", "callgrind.out", #{running => true}). +ok +(hello_world_example@host)4> q(). +... +$ qcachegrind _rel/hello_world_example/callgrind.out +``` + +The benchmark can be done with `wrk` for example: + +``` bash +$ wrk -c100 -d10s http://localhost:8080 +``` + +The benchmark results don't matter, what we want is to see what +`qcachegrind` tells us about what happened in the system while +the benchmark was running. + +// @todo Need to run the above again in order to extract a picture to put here. + +What we can see in the above picture is that around 8% of the +active time (the time when processes are not waiting for messages) +is spent in `ranch_tcp:setopts/2`. This is when Cowboy sets +`active,once`. Turns out this is really expensive, at least +with synthetic benchmarks, if not more. + +A few years ago Steve Vinoski added `active,N` to Erlang/OTP +to reduce the amount of time spent in the TCP driver. Instead +of having to call `setops/2` for every packet we want to get +from the socket, we can tell the driver how many packets we +want and reduce the number of `setopts/2` calls. -- cgit v1.2.3