What i had in mind is some kind of gen_fsm or gen_statemachine.

For stateless applications it is pretty obvious how supervision trees can improve the reliability. Essentially the only ‘state’ there is request itself. Worst case scenario client would just retry.

But with some state involved it becomes not as simple.

Essentially the answer to my question probably would be like : “you should store the state in the external system, and design your system in such a way that stored state is always consistent. In case of failure supervisor will respawn the process and it will recreate what it needs from the saved state”

You may find these interesting...

- "The Onion Layer Theory" https://learnyousomeerlang.com/building-applications-with-ot...

- "On Erlang, State and Crashes" http://jlouisramblings.blogspot.com/2010/11/on-erlang-state-...

- "Why Restarting Works" https://ferd.ca/the-zen-of-erlang.html (search for "Heisenbug")

> you should store the state in the external system

Disk works too, but if you're multi-node this means you now have a distributed database embedded in your system, which may or may not be your goal :)

RabbitMQ does this, they developed a library for "persistent, fault-tolerant and replicated state machines" based on Raft: https://github.com/rabbitmq/ra.