What does HackerNews think of conceptnet-numberbatch?

Language: Python

I'm using those vectors, which latest version is from 2019:

https://github.com/commonsense/conceptnet-numberbatch

I guess data used for making those vectors doesn't contain many occurrences of those two words in relation.

Anyway, that's downside of word vectors idea. There always will be some words which we human will consider more or less related than word vectors.

I've tried finding best one. It's different what Semantle uses (word2vec from Google) and different what Contexto uses (Glove). But still there are probably many word pairs which could match better.

Hello,

Thanks for posting about Enlinko. I'm author of it, i've published it 24 days ago:

https://news.ycombinator.com/item?id=35630451

Domain for this game has been created 9 days ago. So, i think someone was heavily inspired by my idea.

I understand that anyone can make game with same idea, but i'm bit sad that Enlinko haven't got such traction on HN as this game.

As for relatedness, my game uses semantic vectors from this model https://github.com/commonsense/conceptnet-numberbatch

The game uses a set of semantic vectors from ConceptNet to calculate the relatedness between two words.

You can read more about them and use them yourself here - https://github.com/commonsense/conceptnet-numberbatch

For those who aren't aware, @rspeer has been taking this problem seriously for years.

His ConceptNet NumberBatch embeddings[1] are one of the few pre-built releases which attempt to fix this.

[1] https://github.com/commonsense/conceptnet-numberbatch

I'm going to say this is probably state of the art among methods that learn from scratch from a corpus of text. I say "probably" because they skipped most of the usual word-similarity evaluations, only using RW.

The RW results are still lower than any results for ConceptNet Numberbatch [1] in the case where you are able to use a knowledge graph. This is still an advance -- from my point of view, these vectors are an improved input for Numberbatch -- but it continues to surprise me how the possibility of learning from a knowledge graph is not even mentioned when the big players write about these evaluations.

It also pains me that the only analogies evaluated are Mikolov et al. (2013), which is a huge huge case of Not Invented Here just because Mikolov is the first author. It is not a good evaluation [2]. It is the same 10 or so boring analogies over and over. I would much rather see BATS [3] or SemEval-2012, or even Turney's set of real SAT analogies despite their non-free status. But this would require Mikolov to admit that he did not come up with the perfect semantic evaluation off the top of his head, on his first try, without referring to any of the work already done by Turney and others.

[1] https://github.com/commonsense/conceptnet-numberbatch

[2] https://www.aclweb.org/anthology/N/N16/N16-2002.pdf

[3] https://aclweb.org/aclwiki/Bigger_analogy_test_set_(State_of...

Look, let's talk about replacing word2vec because it's old and not that good on its own. Everyone making word vectors, including this article, compares them to word2vec as a baseline because beating word2vec on evaluations is so easy. It's from 2013 and machine learning moves fast these days.

You can replace pre-trained word2vec in 12 languages (with aligned vectors!) with ConceptNet Numberbatch [1]. You can be sure it's better because of the SemEval 2017 results where it came out on top in 4 out of 5 languages and 15 out of 15 aligned language pairs [2]. (You will not find word2vec in this evaluation because it would have done poorly.)

If you want to bring your own corpus, at least update your training method to something like fastText [3], though I recommend looking at how to improve it using ConceptNet anyway, because distributional semantics alone will not get you to the state of the art.

Also: what pre-built word2vec are you using that actually contains valid word associations in many languages? Something trained on just Wikipedia? Al-Rfou's Polyglot? Have you ever actually tested it?

[1] https://github.com/commonsense/conceptnet-numberbatch

[2] http://nlp.arizona.edu/SemEval-2017/pdf/SemEval002.pdf

[3] https://fasttext.cc/

I think the OP might have been looking for code that already exists, not just techniques for how to build it. But I have some opinions on these techniques.

LDA is not very controllable. It gives you a set number of clusters. Run it again and you get different clusters.

It can't give you documents that are similar to a particular document that you asked for, except to say that all the documents that happen to be in a cluster at the time are similar.

There are indeed up-to-date things you can do with word vectors -- though I'm sad that the tutorials always point to word2vec or GloVe, as if it's 2014. That's ancient in machine learning terms, and we now know of flaws in their outputs, such as [1].

If you want downloadable, pre-computed word vectors, ConceptNet Numberbatch [2] (part of the ConceptNet project that I develop) is the best in class right now, and you can download it in the same format as these older systems. And if you don't believe me tooting my own horn, at least use something else that's been updated in the last year, such as NASARI, or maybe fastText's precomputed English vectors.

Or you can just compare "bags of words", which is still good enough for many applications and readily available out of the box. I believe this is what you get with a "More Like This" plugin for a search engine such as Lucene, for example.

[1] https://arxiv.org/abs/1607.06520

[2] https://github.com/commonsense/conceptnet-numberbatch

Thankyou for your work on ConceptNet. It's the best public knowledge graph in existence.

Just today I was using the multilingual Conceptnet-numberbatch word vectors[1], which would not be possible without your work.

To your point though - you can use Amazon S3 as a seed for Bitorrent downloads, which might help some and reduce what you pay. See [2]

[1] https://github.com/commonsense/conceptnet-numberbatch

[2] http://docs.aws.amazon.com/AmazonS3/latest/dev/S3Torrent.htm...

Cool. I'm going to be trying this out and comparing to ConceptNet Numberbatch [1] (the current state-of-the-art multilingual vectors). I'd like to see how they compare on existing evaluations of word similarity.

EDIT: never mind the comparison, I just got the part where the monolingual performance should be the same as fastText's. Which I don't think is a great thing. The precomputed fastText vectors aren't very good in most languages because most Wikipedias are small. Even the Japanese fastText vectors perform only slightly better than chance on Sakaizawa's evaluation of Japanese word similarity [2].

I would expect that, having data from both Wikipedia and Google Translate, you should be able to make a system that's monolingually much better than fastText on Wikipedia alone. That's what I was hoping to compare to. Don't limit yourself to the performance of fastText's data.

I also observe that the data files could be made much, much smaller. Every value is written out as a double-precision number in decimal. Most of these digits convey no information, and they're effectively random, limiting the effectiveness of compression. These could all be rounded to about 4 digits after the decimal point with no loss of performance.

[1] https://github.com/commonsense/conceptnet-numberbatch

[2] https://github.com/tmu-nlp/JapaneseWordSimilarityDataset

The graph algorithm they're describing is basically Manaal Faruqui's "retrofitting", although they don't cite him. I will make the charitable assumption that they came up with it independently (and managed to miss the excitement about it at NAACL 2015).

Here's why not to be sad: Retrofitting and its variants are quite effective, and surprisingly, they're not really that computationally intensive. I use an extension of retrofitting to build ConceptNet Numberbatch [1], which is built from open data and is the best-performing semantic model on many word-relatedness benchmarks.

The entirety of ConceptNet, including Numberbatch, builds in 5 hours on a desktop computer.

Big companies have resources, but they also have inertia. Some problems are solved by throwing all the resources of Google at the problem, but that doesn't mean it's the only way to solve the problem.

[1] https://github.com/commonsense/conceptnet-numberbatch