What does HackerNews think of wrk2?

A constant throughput, correct latency recording variant of wrk

Language: C

If you want a simple load testing tool for HTTP, use wrk2[1].

    wrk -t2 -c100 -d30s -R2000 http://127.0.0.1:8080/index.html
> This runs a benchmark for 30 seconds, using 2 threads, keeping 100 HTTP connections open, and a constant throughput of 2000 requests per second (total, across all connections combined).

Some distros include `ab`[2] which is also good, but wrk2 improves on it (and on wrk version 1) in multiple ways, so that's what I use myself.

[1] https://github.com/giltene/wrk2

[2] https://httpd.apache.org/docs/2.2/programs/ab.html

I agree, sounds like a good approach.

Can the instance do 2 GB/s to disk at the same time it is doing 3.1GB/s across the network? Is that bidirectional capacity or on a single direction? How many threads does it take to achieve those numbers?

That is kind of a nice property, that the network has 50% more bandwidth than the disk. 2x would be even nicer, but that turns out to be 1.5 and 3, so a slight reduction in disk throughput.

Are you able to run a single RP Kafka node and blast data into it over loopback? That could isolate the network and see how much of the available disk bandwidth a single node is able to achieve over different payload configurations before moving on to a distributed disk+network test. If it can only hit 1GB/s on a single node, you know there is room to improve in the write path to disk.

The other thing that people might be looking for when using RP over AK is less jitter due to GC activity. For latency sensitive applications this can be way more important than raw throughput. I'd use or borrow some techniques from wrk2 that makes sure to account for coordinated omission.

https://github.com/giltene/wrk2

https://github.com/fede1024/kafka-benchmark

https://github.com/fede1024/rust-rdkafka

Note that Gil Tene has his own wrk fork. The readme has some interesting notes about wrk: https://github.com/giltene/wrk2

There is also open issue on GH questioning wrks approach for resolving coordinated omission: https://github.com/wg/wrk/issues/485

This is a deep, ultimately unsatisfying, rabbit hole to fall into

Needs (2015).

I loved the talks from Gil Tene.

I always reach for his fork of wrk whenever I need to test throughput:

https://github.com/giltene/wrk2

i use https://github.com/giltene/wrk2 pretty regularly.

it has decent lua hooks to customize behavior but i use it in the dumbest way possible to hammer a server at a fixed rate with the same payload over and over.

i run it by hand after a big change to the server to make sure nothing obviously regressed. i used to run it nightly in a jenkins job but 99% of the time no one looked at results. it was nice to see if assumptions on load a single node could handle didn't hold anymore.

I so deeply wish for a set of benchmarks that accepted only idiomatic, standard code that a slightly above average developer would plausibly write.

Pretty much every set of benchmarks out there is ruined by monstrous unrealistic code that squeezes out 5% more performance than actually well written code.

As a side note, one flaw of TechEmpower benchmarks is that to my knowledge they use the original "wrk" tool, which only supports HTTP 1.0 and is sensitive to "coordinated omission" [0]. This means it ends up being biased in favor of web frameworks that implement specific optimizations or have specific behaviors that aren't actually useful in a production context.

[0] https://github.com/giltene/wrk2

I'm not disagreeing with your results, but you should be using https://github.com/giltene/wrk2 based benchmarks to avoid coordinated omission errors in measuring latency.
> I know there's much better ways of testing load/performance.

such as

https://github.com/giltene/wrk2

For very high traffic loads, I have yet to find anything that compares to wrk (https://github.com/giltene/wrk2). IME, it scales better and more efficiently than anything else I've tried, short of complex distributed tools that are only relevant to a small number of people.
The explanations here gave a lot of details on the effect, but IMHO, not as many details in the cause of Coordinated Omission (CO). Most of what I'll be saying here comes from a CMU's paper titled "Open vs Closed: A Cautionary Tale"[1] and from Gil Tene's talk.

First, some terminology which I think is important for the discussion, also when I say 'job' this could be something like a user, HTTP request, RPC call, network packet, or some sort of task the system is asked to do, and can accomplish in some finite amount of time.

Closed-loop system, aka closed system - is a system where new job arrivals are only triggered by job completions, some examples are interactive terminal, batch systems like a CI build system.

Open-loop system, aka open system - is a system where new job arrivals are independent of job completions, some examples are the requesting the front page of Hacker news, or arriving packets to a network switch.

Partly-open system - is a system where new jobs arrive by some outside process as in an open system, and every time a job completes there is a probability p it makes a follow-up request, or probability (1 - p) it leaves the system. Some examples are web applications, where users request a page, and make follow-up requests, but each user is independent, and new users are arriving and leaving in their own.

Second, workload generators (e.g. JMeter, ab, Gatling, etc) can also be classified similarly. Workload generators that issue a request, and then block to wait for a response before making the next request are based on a closed system (e.g. JMeter[2], ab). Those generators that continue to issue requests independently of the response rate, regardless of the system throughput, are based on an open system (e.g. Gatling, wrk2[3])

Now, CO happens whenever a workload generator based on a closed system is used against an open system or partly open system, and the throughput of the system under load is slower than the injection rate of the workload generator.

For the sake of simplicity, assume we have an open system, say a simple web page, where multiple users arrive by some probability distribution and simply request the page, and then 'leave'. Assume the arrival probability distribution is uniform, where the p is 1.0 that a request will arrive every second.

In this example, if we use a workload generator based on a closed system to simulate this workload for 100 seconds, and the system under load never slows downs so it continuous to serve a response under 1 second, say that is always 500 ms. Then there's no CO here. In the end, we will have 100 samples of response times of 500ms, all the statistics (min, max, avg, etc) will be 500ms.

Now, say we are using the same workload generator at an injection rate of 1 request/s, but this time the system under load for the first 50 seconds will behave as before with responses taking 500 ms, and for the later 50 seconds the system stalls.

Since the system under load is an open system, we should expect 50 samples of response times with 500 ms, and 50 samples where response times linearly decrease from 50s to 1s. The statistics then would be

min=500ms, max=50s, avg=13s, median=0.75s, 90%ile=45.05s

But because we used a closed system workload generator, our samples are skewed. Instead, we get 50 samples of 500ms and only 1 samples of 50 seconds! This happens because the injection rate is slowed down by the response rate of the system. As you can see this is not even the workload we intended because essentially our workload generator backed off when the system stalled. The stats now look like this:

min=500ms, max=50s, avg=1.47s, median=500ms, 90%ile=500ms.

[1][pdf] http://repository.cmu.edu/cgi/viewcontent.cgi?article=1872&c... [2] http://jmeter.512774.n5.nabble.com/Coordinated-Omission-CO-p... [3] https://github.com/giltene/wrk2

Imagine a simple load testing tool that sends a request, measures time-to-response, waits a second, and sends another request. Such a tool might report a "99 percentile latency" by simply taking the second-worst time-to-response from 100 samples.

However, this statistic is misleading. Imagine a system that responds in 1 second to 98 of the requests, in 2 seconds to the second-worst request ("99th percentile latency") and in 100 seconds to the worst request. If this system were to process a continuous stream of requests - e.g. 100 requests from visitors to a web site - half of all visitors would arrive while the system was processing the worst-case request (which takes 100s of a total 200s of run time). So the customer with the 99th-percentile worst experience will experience about 100s of latency, not 2s!

E.g. the author's https://github.com/giltene/wrk2 tries to avoid this "coordinated omission" problem by sending requests at a constant rate, instead of constant-time-after-response.

(Azul Systems competes against HotSpot JVMs with garbage collectors tuned for throughput, which suffer from occasional long pauses; Azul sells a rather impressive JVM which largely avoids such long pauses, and needs to educate its customers on how to benchmark it properly.)

If you're using ab or httpperf, you should also take a look at wrk (https://github.com/wg/wrk). It can generate significant amount of load and has support for Lua scripting for request generation, response processing and reporting.

There is another fork of it (https://github.com/giltene/wrk2) which can be used if you want test at a consistent RPS.

There's a branch from Gil Tene (Azul) that fixes wrk for the coordinated omission problem, which he explains in the readme of that same repo[1].

[1] https://github.com/giltene/wrk2