If it can be interfered from words alone, try Latent Dirichlet Allocation (e.g. with http://radimrehurek.com/gensim/) to generate tags. Some sources:

* http://blog.echen.me/2011/08/22/introduction-to-latent-diric...

* http://alexperrier.github.io/jekyll/update/2015/09/04/topic-...

* http://engineering.flipboard.com/2017/02/storyclustering

Alternatively, if you know tags, just want to see which are similar to each other, methods like word2vec should help, vide:

* http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html

I think the OP might have been looking for code that already exists, not just techniques for how to build it. But I have some opinions on these techniques.

LDA is not very controllable. It gives you a set number of clusters. Run it again and you get different clusters.

It can't give you documents that are similar to a particular document that you asked for, except to say that all the documents that happen to be in a cluster at the time are similar.

There are indeed up-to-date things you can do with word vectors -- though I'm sad that the tutorials always point to word2vec or GloVe, as if it's 2014. That's ancient in machine learning terms, and we now know of flaws in their outputs, such as [1].

If you want downloadable, pre-computed word vectors, ConceptNet Numberbatch [2] (part of the ConceptNet project that I develop) is the best in class right now, and you can download it in the same format as these older systems. And if you don't believe me tooting my own horn, at least use something else that's been updated in the last year, such as NASARI, or maybe fastText's precomputed English vectors.

Or you can just compare "bags of words", which is still good enough for many applications and readily available out of the box. I believe this is what you get with a "More Like This" plugin for a search engine such as Lucene, for example.

[1] https://arxiv.org/abs/1607.06520

[2] https://github.com/commonsense/conceptnet-numberbatch