Maybe 'Image Quilting for Texture Synthesis and Transfer', Efros and Freeman [0]?

There's some neural / patch blends from 2016 that I always thought were interesting (CNN-MRF) [1], and I think there's a renaissance in those approaches recently (combined with other generators / prompts etc.). You can also argue ViT is "patch based" in a major sense... I am still a big believer in patch + combinations + warping (non-parameteric synthesis) generally, some cool older work from Apple on that in speech land [2].

I go as far as arguing BPE / wordpiece / sentencepiece / tokenizers in general are key for modern approaches (as were word vocab selections in the earlier days of NMT), because they find 'good enough' patches (tokens) for a higher level model to stitch together while still having some creativity / generalization available... but we focus on the model details rather than the importance of the tokenizer (and tokenizer distribution) in publication many times.

[0] http://people.eecs.berkeley.edu/~efros/research/quilting.htm...

[1] https://github.com/chuanli11/CNNMRF

[2] https://machinelearning.apple.com/research/siri-voices