What does HackerNews think of sentencepiece?

LLM.int8(): 8-Bit Matrix Multiplication for Transformers at Scale (2022) | Jun 2023

pre training and fine tuning use the exact same method of next token prediction. the difference is in the quantity of data you have (& whether the model is pre trained). you need to train the model on 1 trillion tokens (https://platform.openai.com/tokenizer https://github.com/google/sentencepiece) anyways for it to get reasoning capacities, which it feels very unlikely that your data is that much.

I'm highly skeptical that you have enough data to pretrain if you don't have enough data to fine tune.

fine tuning + vector search + prompting of as much stuff as you can, on a LLM like palm2 or gpt4 is what I would do. otherwise you can use falcon 40B ofc.

maybe I should charge for this ahah

Large language models are having their Stable Diffusion moment | Mar 2023

Expand Context ↕

The original llama uses google's tooling for that, written in C++ https://github.com/google/sentencepiece

LLaMA-7B in Pure C++ with full Apple Silicon support | Mar 2023

If you are interested in implementing LLaMA yourself or learning, I noticed that the reference code by Facebook is one of the cleaner, easier to read ML code I've seen in a while. https://github.com/facebookresearch/llama/blob/main/llama/mo... It's about 200 lines long. You probably do need a bit of knowledge to understand what you are reading but I was pleasantly surprised.

For example in comparison, StableDiffusion torch code in diffusers and transformers Python libraries has lots of conditionals, experiments etc. that are not being used that can make it hard to follow what is going on.

Last weekend I got the "main loop" of the transformer working in pure CPU Rust code, following the reference code. My crappy code is just very very slow as I focused on getting it to run, not making it fast. The tokenizer uses some Google thing https://github.com/google/sentencepiece but luckily for inference it seems that you just need to be able to parse the tokenizer model file and not understand how it was created; I was able to strip out the protobuf files from that repository and add it to Rust and read the tokens.

I am optimistic that someone makes a high quality CPU or some CPU+GPU+SSD combination thingmaling that will make it somewhat practical to run even the large LLM models without needing an A100 or two.

ChatGPT in an iOS Shortcut – Worlds Smartest HomeKit Voice Assistant | Jan 2023

Expand Context ↕

> Can somebody help me to understand how a LLM has achieved this capability.

It's worth clarifying what is being accomplished here. iOS is handling speech recognition, and Shortcuts is handling task execution with an unspecified and presumably long user script. What GPT does here is convert text instructions into JSON formatted slot filling[1] responses.

It's somewhat amazing that GPT is emitting valid JSON, but I guess it's seen enough JSON in the training set to understand the grammar, and we shouldn't be too surprised it can learn regular grammars if it can learn multiple human languages. Slot filling is a well studied topic, and with the very limited vocabulary of slots, it doesn't have as many options to go wrong as commercial voice assistants. I would be way more amazed if this were able to generate Shortcuts code directly, but I don't think that's allowed.

> some of its output (eg future time stamps) will not be present in its training data. Even with several billion parameters that seems impossible

Maybe this is a feature of attention, which lets each token look back to modify its own prediction, and special token segmentation[2] for dates?

[1]: http://nlpprogress.com/english/intent_detection_slot_filling... [2]: https://github.com/google/sentencepiece

Dall-E 2 | Apr 2022

Expand Context ↕

Haven't read the paper, but they are probably using something like sentencepiece with sub-word splitting and then charge by the number of resulting token.

https://github.com/google/sentencepiece

Fast word vectors with little memory usage in Python | Sep 2018

You can also speed up the loading of embeddings by using BPE (byte pair encoding) to segment words into a smaller dictionary of char ngrams, and learning ngram embeddings instead of words.

You can replace a list of 500K words with 50K ngrams, and it also works on unseen words and agglutinative languages such as German. It's interesting that it can both join together frequent words or split into pieces infrequent words, depending on the distribution of characters. Another advantage is that the ngram embedding size is much smaller, thus making it easy to deploy on resource constrained systems such as mobile phones.

Neural Machine Translation of Rare Words with Subword Units

https://arxiv.org/abs/1508.07909a

A Python library for BPE ngrams: sentencepiece

https://github.com/google/sentencepiece