What does HackerNews think of fastText?

Library for fast text representation and classification.

Language: HTML

Was working in NLP from about 2014-2020. At the start, NB was indeed generally the best performing baseline model you could use, and we would train it for every task. However, we came to realise that Facebook's FastText model[1] was almost always the better choice as soon as your training set was more than a few hundred samples. Some advantages:

- accuracy generally a few points better than NB

- better generalisability due to the word embeddings acting as a bottleneck on expressiveness compared to NB or logistic regression which essentially model all words / bigrams as independent

- trained with cross-entropy, meaning that model scores can be used more effectively as a 'confidence' - e.g. for spam if you want to say something like "if prediction score > X, then filter", Naive Bayes is not ideal due to the 'naive' assumption which makes the scores very un-calibrated (it tends to give extremely high or low confidence scores, which gets worse with document length).

- is completely linear (or at least log-linear like NB), so explainability is super simple.

disclaimer: I haven't really thought about NLP for about 3 years so there may be something better than this now

[1] https://github.com/facebookresearch/fastText

Word2Vec and bag-of-words/tf-idf are somewhat obsolete in 2018 for modeling. For classification tasks, fasttext (https://github.com/facebookresearch/fastText) performs better and faster.

Fasttext is also available in the popular NLP Python library gensim, with a good demo notebook: https://radimrehurek.com/gensim/models/fasttext.html

And of course, if you have a GPU, recurrent neural networks (or other deep learning architectures) are the endgame for the remaining 10% of problems (a good example is SpaCy's DL implementation: https://spacy.io/). Or use those libraries to incorporate fasttext for text encoding, which has worked well in my use cases.

https://github.com/facebookresearch/fastText this is Facebook's super efficient word2vec like implemention. i thought ppl might find it interesting
also, word2vec are super fast and work great. the text has no convincing argument on why not to use them, unless you don't want to learn basic neutral nets. even then, just use Facebook fast text : https://github.com/facebookresearch/fastText
Link to original HN submission: https://news.ycombinator.com/item?id=14337275

It's worth noting for future reference that in terms of supervised learning of labels given a text document input, fasttext (https://github.com/facebookresearch/fastText) is leagues ahead of conventional approaches in both accuracy and training speed, and there is a Python interface (https://github.com/salestock/fastText.py) for use with Django/Flask (unfortunately, recent fasttext changes have broken the interface for now).

Great podcast.

The world owes a big THANK YOU to Tomáš Mikolov, one of the creators of Word2Vec[0] and fastText[1], and also to Radim Řehůřek, the interviewer, who is the creator of gensim[1].

The number of software developers and researchers in industry and academia who rely on the work of these two individuals is large and growing every day.

[0] https://code.google.com/p/word2vec/

[1] https://github.com/facebookresearch/fastText

[2] https://radimrehurek.com/gensim/

It's worth noting that fasttext, which was made in part by the original word2vec authors, can handle as many cores as you throw at it.

https://github.com/facebookresearch/fastText

The baseline I'd like to see this compared to is the not-very-deep-learning "bag of tricks" that's conveniently implemented in fastText [1].

[1] https://github.com/facebookresearch/fastText

This was posted in 2015, but now that fasttext (https://github.com/facebookresearch/fastText), just released by Facebook and can scale to Instagram-sized datasets, can create word vectors better than word2vec which account for context (https://arxiv.org/pdf/1607.04606v1.pdf), this type of analysis will only improve in the future.
Link to source code: https://github.com/facebookresearch/fastText

The big result here is the 15,000x speedup compared to a neural network, and which increases as the size of the dataset increases. But this doesn't mean neural networks are worthless. From the paper:

Although deep neural networks have in theory much higher representational power than shallow models, it is not clear if simple text classification problems such as sentiment analysis are the right ones to evaluate them.