I'd argue that modal behavior isn't a bad thing so long as those modes are very clearly distinguished.

It doesn't hurt if they're encountered frequently enough that the user / operator has a clear sense of their distinction.

vi/vim is a modal editor. There are two ways of interacting with it: "insert" mode, and "normal" (command) mode. In one, you're editing your document, in the other you're operating on it. While frustrating for new users, with experience using modes becomes transparent, as they're switched in and out of all the time. Though the visual distinction of modes isn't always clear.

In a non-programming context, most sailboats have two primary operating modes: under sail, and under power. Here, the contexts are evident from a large number of cues in the environment, and how the boat is skippered is evident from these cues.

Other tools have many modes of operation -- the various apps on a smartphone/tablet effective change the mode of the device. Is it a phone? A music player? A messaging tool? A web browser? The "mode" is specified by the application. Usually visual cues of the display indicate which specific mode is operative.

Everyone I've seen has trouble with this in the beginning. I'm not an advanced vim user, would it really be impossible to make it modeless without losing too much power?

You could probably make an equally powerful modeless editor but I don't see how it could be anywhere near vim-like. Maybe if you had a foot pedal to indicate what modes to interpret input sequences in.

Some built such a foot pedal: https://github.com/alevchuk/vim-clutch