Not only does 3x seem quite significant, but thread context switches are also a large overhead and that goes untested here. If thread context switch overhead was as low as usermode context switching, there would be no use for coroutines since you could just use threads instead; I doubt it’s non-trivial.

(Of course, in Go, the scheduler also weaves in the GC IIRC, so an apples-apples comparison may be difficult. Micro benchmarks are just not that useful.)

P.S.: this article seems to work under the assumption that 10,000 Goroutines is a reasonable upper limit, or at least it feels as though it implies that. However, you can definitely run apps with 100,000 or even 1,000,000.

The performance of the Go runtime with a million blocked goroutines is pretty OK, but its performance with even 1000 runnable goroutines is not great at all. You really need to think about which you are going to have.

True, but if you regularly have 1000 runnable goroutines then could you not reconfigure your app to have, say, 64 runnable goroutines, and get better throughput? Large numbers of goroutines do seem to be a good fit for problems which are mostly waiting on the network.

The architecture of Go forces you to have 1 goroutine servicing every socket, so the number of runnable goroutines will then be at the mercy of your packet inter-arrival process.

Go does not forces you to do any of that: https://github.com/panjf2000/gnet ( and as you can see it's as fast as the fastest C++ lib )

Go provides go routines as a building block, it's up to you to use that or use something else ( reactor pattern, epoll ect ... )