_build/content/articles/cowboy2-performance.asciidoc


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

+++
date = "2020-12-07T07:00:00+01:00"
title = "Cowboy 2 performance"

+++

https://github.com/sponsors/essen[You can now reward my work via GitHub Sponsors].

Recently an article was published by Stressgrid entitled
https://stressgrid.com/blog/cowboy_performance/[Survey of Cowboy Webserver Performance]
that compares Cowboy performance across the different versions
as well as Erlang/OTP versions. The results are not very surprising
to me personally (although the drop is bigger than I expected),
but they might be to others.

This prompted an experiment that I will now describe in two parts.
The first part is about modifying Cowboy to use `active,N` instead
of `active,once` to reduce the amount of time spent in the TCP
driver. The second part is about writing a stream handler in order
to squeeze the most performance out of Cowboy 2.

In order to support both HTTP/1.1 and HTTP/2 with a common interface
(as well as HTTP/3 in the future), Cowboy 2 switched from the model
of "one process per connection" to "one process per connection +
one process per request". This is required because from HTTP/2
onward requests are processed concurrently rather than sequentially,
not to mention the protocols include a number of control messages
that must be handled at the same time.

But this necessarily has some impact on the performance of HTTP/1.1
connections, and this is what the Stressgrid benchmarks show. Note
that while I will demonstrate in this article that it is indeed the
use of multiple processes that causes this reduction in performance,
I do not really know why this happens, though.

After reading the blog post I started experimenting. I took Cowboy's
`hello_world` example and added https://github.com/rabbitmq/looking_glass[Looking Glass]
to the release. I then ran a quick benchmark against the example with
Looking Glass enabled:

``` erlang
$ make run
...
(hello_world_example@host)1> lg:trace([
    {app, ranch}, {app, cowlib}, {app, cowboy}, {app, stdlib}
], lg_file_tracer, "traces.lz4", #{mode => profile, running => true}).
ok
... Run the benchmark here for a few seconds.
(hello_world_example@host)2> lg:stop().
ok
(hello_world_example@host)3> lg_callgrind:profile_many("traces.lz4.*", "callgrind.out", #{running => true}).
ok
(hello_world_example@host)4> q().
...
$ qcachegrind _rel/hello_world_example/callgrind.out
```

The benchmark can be done with `wrk` for example:

``` bash
$ wrk -c100 -d10s http://localhost:8080
```

The benchmark results don't matter, what we want is to see what
`qcachegrind` tells us about what happened in the system while
the benchmark was running.

// @todo Need to run the above again in order to extract a picture to put here.

What we can see in the above picture is that around 8% of the
active time (the time when processes are not waiting for messages)
is spent in `ranch_tcp:setopts/2`. This is when Cowboy sets
`active,once`. Turns out this is really expensive, at least
with synthetic benchmarks, if not more.

A few years ago Steve Vinoski added `active,N` to Erlang/OTP
to reduce the amount of time spent in the TCP driver. Instead
of having to call `setops/2` for every packet we want to get
from the socket, we can tell the driver how many packets we
want and reduce the number of `setopts/2` calls.