What does HackerNews think of sentencepiece?
Unsupervised text tokenizer for Neural Network-based text generation.
I'm highly skeptical that you have enough data to pretrain if you don't have enough data to fine tune.
fine tuning + vector search + prompting of as much stuff as you can, on a LLM like palm2 or gpt4 is what I would do. otherwise you can use falcon 40B ofc.
maybe I should charge for this ahah
For example in comparison, StableDiffusion torch code in diffusers and transformers Python libraries has lots of conditionals, experiments etc. that are not being used that can make it hard to follow what is going on.
Last weekend I got the "main loop" of the transformer working in pure CPU Rust code, following the reference code. My crappy code is just very very slow as I focused on getting it to run, not making it fast. The tokenizer uses some Google thing https://github.com/google/sentencepiece but luckily for inference it seems that you just need to be able to parse the tokenizer model file and not understand how it was created; I was able to strip out the protobuf files from that repository and add it to Rust and read the tokens.
I am optimistic that someone makes a high quality CPU or some CPU+GPU+SSD combination thingmaling that will make it somewhat practical to run even the large LLM models without needing an A100 or two.
It's worth clarifying what is being accomplished here. iOS is handling speech recognition, and Shortcuts is handling task execution with an unspecified and presumably long user script. What GPT does here is convert text instructions into JSON formatted slot filling[1] responses.
It's somewhat amazing that GPT is emitting valid JSON, but I guess it's seen enough JSON in the training set to understand the grammar, and we shouldn't be too surprised it can learn regular grammars if it can learn multiple human languages. Slot filling is a well studied topic, and with the very limited vocabulary of slots, it doesn't have as many options to go wrong as commercial voice assistants. I would be way more amazed if this were able to generate Shortcuts code directly, but I don't think that's allowed.
> some of its output (eg future time stamps) will not be present in its training data. Even with several billion parameters that seems impossible
Maybe this is a feature of attention, which lets each token look back to modify its own prediction, and special token segmentation[2] for dates?
[1]: http://nlpprogress.com/english/intent_detection_slot_filling... [2]: https://github.com/google/sentencepiece
You can replace a list of 500K words with 50K ngrams, and it also works on unseen words and agglutinative languages such as German. It's interesting that it can both join together frequent words or split into pieces infrequent words, depending on the distribution of characters. Another advantage is that the ngram embedding size is much smaller, thus making it easy to deploy on resource constrained systems such as mobile phones.
Neural Machine Translation of Rare Words with Subword Units
https://arxiv.org/abs/1508.07909a
A Python library for BPE ngrams: sentencepiece