What does it cost to train a 6.7B transformer from scratch. Not considering any data preparation because that would be highly variable. Is this realistically possible for mere mortals? How long until it'll become a national past time?

fine-tuning is cheap, pre training is expensive & hard

Sure, but fine-tuning has limits.

no, you never should pre-train your own LLM unless you have 100k$+ to spare. You should only fine-tune. There is no reason you can't just fine-tune with whatever data you have

I have a huge company internal dataset with domain specific knowledge. What you are saying is that I can just fine-tune an existing model with that data and be fine?

That was exactly our inital idea but from all I learnt when trying this is a dead end approach. From my understanding the consensus seems to be that fine-tuning works well to alter or restrict behaviour but very badly to teach additional knowledge. You can fine-tune a generic base model with generic chatbot but not into a domain expert.

That also seems to be the reason why people still use vector databases for large domain knowledge data. I'm aware that the vector database approach has different pros and cons but if fine-tuning the whole content would be possible we certainly would use it in addition to that.

I'm not an expert, so I'd appreciate any comments, hints, pointers and corrections if I'm mistaken in my understanding.

And my original question still stands. 100k$ is not a lot for a company, it must certainly be more than that?

pre training and fine tuning use the exact same method of next token prediction. the difference is in the quantity of data you have (& whether the model is pre trained). you need to train the model on 1 trillion tokens (https://platform.openai.com/tokenizer https://github.com/google/sentencepiece) anyways for it to get reasoning capacities, which it feels very unlikely that your data is that much.

I'm highly skeptical that you have enough data to pretrain if you don't have enough data to fine tune.

fine tuning + vector search + prompting of as much stuff as you can, on a LLM like palm2 or gpt4 is what I would do. otherwise you can use falcon 40B ofc.

maybe I should charge for this ahah