At cloudflare-scale, I think I would want a proxy that sits at least partially in the network hardware.

Ie. so that the bytes of a large image or video that is coming from an origin server and being sent to a client never has to pass through the system RAM or CPU.

I'd probably implement this as mods to the sendfile() API so that a certain number of bytes can be copied from one socket to another, with that request going all the way to the firmware of the network card which will do the actual work.

It probably needs to work with HTTP/3 UDP and encryption too - so decrypt the data from this socket and send it to that socket reencrypted with this other key and packetized to this HTTP/3 stream. The firmware would need some way to do aborts/timeouts and kick a partially complete request back to software too.

Is it complex...? Yes. But will the compute savings be worth it...? At cloudflare scale, I think the answer is yes.

We constantly think about the possibilities of using exotic hardware for acceleration. So far we've got a very, very long way with "commodity" hardware and the Linux kernel. One day, we'll probably do something exotic.

https://blog.cloudflare.com/cloudflares-gen-x-servers-for-an...

https://blog.cloudflare.com/tubular-fixing-the-socket-api-wi...

Have you tried out io_uring? My impression is that these Rust libs end up making different syscalls that are less great.

Tokio supports io_uring (https://github.com/tokio-rs/tokio-uring), so perhaps when it's mature and battle-tested, it'd be easier to transition to it if Cloudflare aren't using it already.