When I first starting using AWS a few years ago, having known generally what it was for far longer, I was flabbergasted it was at how slow it was to get an instance booted. I expected much less, thinking about things from first principals, even if you're literally talking about cold booting a physical machine via IPMI. But it seemed like everyone accepted that as the way it was and now I do too. So I'm glad people are still interested in making things fast.
Right now I'm doing Postgres stuff (RDS) and dealing with taking 10+ minutes to boot a fresh instance. I'm tempted to try out fly.io and their Postgres clusters but I'm afraid I'd be spoiled and hate my life after (my job has me stuck in AWS for the interminable future).
I would be interested to know where all that time is being spent in on the AWS side. To be a fly on the wall seeing their full, unfiltered logging and metrics.
Disclosure: I used to work on GCE.
EC2 has historically not focused much on instance boot time. We did for GCE and drove it down pretty heavily. The post here from fly has a good set of sequence diagrams for "what are the various phases of creating an instance from scratch" that are generally applicable.
I'll note though that different users have different targets. Some people care about "time from request to first instruction ticks over" while others only care about "time from request to ssh'able from the public internet". There's an interesting middle ground of "time from request to being able to talk to other services like GCS or S3".
It's not clear to me what the networking / discovery story is for a Fly Machine that is stopped and then starts. That is, how long does fly-proxy take to update (globally? within a metro?) to add and remove the new Fly Machine? I vaguely recall that only external endpoints support IPv4, so I assume Fly is reserving and registering the internal IPv6 endpoints in the more expensive "create" step and then "start" is just about propagating liveness.
> I'll note though that different users have different targets. Some people care about "time from request to first instruction ticks over" while others only care about "time from request to ssh'able from the public internet".
This is the same target: a machine (that usually only has single app on it) shouldn't take more time to boot than a general-purpose consumer PC/laptop.
The reason it takes so darn long to start in so many cases is just how horrendously overcomplicated the whole cloud setup is internally and externally (sometimes for good reasons, sometimes because we don't know better, sometimes just because it really is just overly complicated and overengineered)
> shouldn't take more time to boot than a general-purpose consumer PC/laptop.
That's an incredibly easy target. VMs can and should boot much faster than that - just look at firecracker hypervisor.
Even with KVM, if you replace systemd with something small and simple [0] (which you totally should, for single-app VMs), boot times of couple of seconds are within reach.
I also remember clicking around Ling (Erlang on Xen, sadly no longer active [1]) where the whole VM could boot up, service the request, and shut down in less time than it takes a cloud to start spinning up an instance :)