Even things like the use of ESC, and the overloading of ESC for the actual ESC key, for a signal for keycodes, and for the ALT/META key are crazy.
I've thought for a while that I should implement an extremely reduced set of escape sequences which use a non-ESC character (maybe SOH or something) for control sequences and key encoding. A minimal set -- color / cursor position / save-screen / query-size -- whatever it takes to make vim/emacs + tmux/screen work, and call it a day.
Those use TERM/TERMCAP environment variables and some educated guessing reading stdin/stdout/stderr behaviour to figure out what terminal you're using, search for said entry in terminfo/termcap database, and finally invoking native terminal behavior or simulating it with other terminal behavior.
In theory, one could "just implement" an alternative terminal, and put said simplified entries in terminfo/termcap database and things will be all fine.
In practice several of those ANSI escape code behaviors are hard-coded and everyone pretends everything is "at least as capable as xterm" (a oxymoron, given that xterm is one of the most capable and feature-rich terminals around), so you're also gonna have to implement a converter from "xterm ANSI" to your alternative terminal system - sort of how winpty does convert Win32 terminals conventions to ANSI ones.
Then we're back to the starting issue in this converter.
One problem is termcap/terminfo itself. The definition has expanded greatly over time, and documentation is poor to non-existent. In many cases termcap/terminfo has been implicitly extended to support ncurses specifically. And like you say, everyone just sort of assumes ANSI/xterm -- nobody has `PS1="$(tput setaf 3)\w $(tput setaf 0) $", you just hard-code the codes.
The last time I chased the rabbit down this hole I got mired in TTY-land and was not able to dig myself out. Too many insane layers, and before you know it you're writing code to emulate serial connections in a vain attempt to make things work without burning the whole world down.
Both (n)vim and tmux have internal TTY emulation as well which makes things even crazier; I think vim uses libvterm and tmux has one internally hand-coded. It's a mess.
Consider a really simple case. You are using screen (or tmux, or emacs, or whatever) and you have your terminal divided in half so that you can use a program in one half and tail a log file in the other. The program sends a ^L to clear the screen. Screen (or tmux or whatever) must read the ^L and then _not_ send it on. If it sent it on, the whole screen would be cleared, not just the half with the program in it. Instead it has to send whatever control sequences would erase the text from just the right portion of the screen while leaving everything else alone. Virtually everything sent by the program(s) has to be intercepted and reinterpreted in order for it to work right.