Is there a paper on how they're resynthesizing phase components? IME neutral networks are real real bad at handling fft phase, so separation tend to use frequency making, or use a learned filter bank.
There's maybe something useful for you here?
https://github.com/facebookresearch/demucs
https://github.com/sigsep/open-unmix-pytorch