I wonder how difficult it would be to make something similar that generated 3D models. Most of the examples look like they'd make good video game levels.

Well, I think there is enough interesting research to put things in place. Not in single model. But, we have

0. This neural thing, of course, to create landscape-like 2D projections of a plausible scene.

1. Wave-function collapse models that synthesize domain data quite nicely when parametrized with artistic care - this is a "simpler" example of the concept. https://github.com/mxgmn/WaveFunctionCollapse

2. Fairly good understanding how to synthesize terrain. Terragen is a good example of this (although not public research, the images drive the point home nicely) https://planetside.co.uk/

So, we could use the source image from this as a 2D projection of an intended landscape as a seed to a wave-function collapse model that would use known terrain parametrization schemes to synthesize something usable (so basically create a Terragen equivalent model).

I think that's it plausibly more or less. But it's a "research" level problem still, I think, not something one can cook up by chaining the data flow from a few open source libraries together.