I find this surprising, from the simplistic (and probably naive) view that images are 2D signals while music is 1D.

"Style transfer" also rarely works for object level transfer - it is more pattern based (high frequency content is often the "style" that is enhanced and transferred). Really nice transfers in practice sometimes require the object level content in the images to be similar, c.f. [0][1]. And all of this is coupled with really heavy human curation (people don't normally show their bad outputs)!

In music the "style" is the content in some sense. For example jazz has very different "style" than classical, at many levels (key and tempo choice/mode choice/melodic intervals/motifs/amount of repetition of said motif/how it varies/harmonization and chord choice/global structure (AABA format)) and it isn't easy separate what pieces make it "jazz", and what don't (what factors of variation matter).

The equivalent in images would be replacing objects as well as texture, to form a new image that is reminiscent of the original but also novel at multiple scales - think Simpson's "Last Supper" as the goal of a style transfer [2].

It is also hard because as consumers we are used to hearing high quality versions of these types of "style transfer" for some styles all the time - and we even have a name for it ... "muzak".

[0] https://raw.githubusercontent.com/awentzonline/image-analogi...

[1] https://github.com/chuanli11/CNNMRF

[2] http://s267.photobucket.com/user/wiro_bucket/media/last%20su...