Serious question, has anyone properly solved the issue of DNS as a single point of failure?

Depending on what point you draw the line of "single point of failure" you could use multiple providers for your dns.

GOV.UK for example uses both aws and gcp for DNS

So, NS entries pointing to both? But then take the example your domain was in Route53 and AWS goes down. You can't configure the NS entries to avoid AWS DNS servers. Is the idea that child DNS servers detect the outage and cache the values in the name server(s) that remain up?

But then, the cached values from AWS take a while to clear, TTL never seems to be applied properly. It always feels like the worst case in such a scenario is you can point everyone at the right thing within 24 hours.

corobo

Have them all hot and live rather than any sort of failover system. Keep everything in sync with OctoDNS or similar

https://github.com/octodns/octodns

DNS is fastest first* rather than main/failover. If AWS DNS was down your GCP DNS would have replied (if all is well) sooner than {timeout} so your visitor would still have a response

* Sort of. I think if the client doesn't get a reply from the server it picked randomly in 1s they move on to the next server, repeat until all fail