I remember that around 2004, before convnets became popular, there was a paper on image texture style transfer using approximate nearest neighbors based on some neighborhood of each point. This technique seems similar but for text.
There's some neural / patch blends from 2016 that I always thought were interesting (CNN-MRF) [1], and I think there's a renaissance in those approaches recently (combined with other generators / prompts etc.). You can also argue ViT is "patch based" in a major sense... I am still a big believer in patch + combinations + warping (non-parameteric synthesis) generally, some cool older work from Apple on that in speech land [2].
I go as far as arguing BPE / wordpiece / sentencepiece / tokenizers in general are key for modern approaches (as were word vocab selections in the earlier days of NMT), because they find 'good enough' patches (tokens) for a higher level model to stitch together while still having some creativity / generalization available... but we focus on the model details rather than the importance of the tokenizer (and tokenizer distribution) in publication many times.
[0] http://people.eecs.berkeley.edu/~efros/research/quilting.htm...