If the current method we use turns out to actually lead to producing content that has a history and can regenerate characters then it will likely be within 5 years.
But right now it is possible it is only a nice math trick that is already near its limits. GPT3 and Stable Diffusion are great for developing one moment / conversation and that may even turn into one scene but beyond that it is extremely difficult to have something coherent.
There is a really exciting model That may mostly solve that but it’s too early right now: https://github.com/rinongal/textual_inversion
For fun I tried to make an entire animated music video but it took over one week of processing and basically fell apart coherently by 30 seconds so just did one third:
> https://github.com/rinongal/textual_inversion
Netflix UI of 2033: I want an episode of Senfield with X, Y, Z... Starting brand new (never aired) just generated by the AI episode now.
It should be noted that the official repo now also supports Stable Diffusion: https://github.com/rinongal/textual_inversion.