I remember that around 2004, before convnets became popular, there was a paper on image texture style transfer using approximate nearest neighbors based on some neighborhood of each point. This technique seems similar but for text.

Maybe 'Image Quilting for Texture Synthesis and Transfer', Efros and Freeman [0]?

There's some neural / patch blends from 2016 that I always thought were interesting (CNN-MRF) [1], and I think there's a renaissance in those approaches recently (combined with other generators / prompts etc.). You can also argue ViT is "patch based" in a major sense... I am still a big believer in patch + combinations + warping (non-parameteric synthesis) generally, some cool older work from Apple on that in speech land [2].

I go as far as arguing BPE / wordpiece / sentencepiece / tokenizers in general are key for modern approaches (as were word vocab selections in the earlier days of NMT), because they find 'good enough' patches (tokens) for a higher level model to stitch together while still having some creativity / generalization available... but we focus on the model details rather than the importance of the tokenizer (and tokenizer distribution) in publication many times.

[0] http://people.eecs.berkeley.edu/~efros/research/quilting.htm...

[1] https://github.com/chuanli11/CNNMRF

[2] https://machinelearning.apple.com/research/siri-voices