At the moment of writing the cost estimate for 70B multimodal model with 7T tokens on 1000 H100 GPUs is $18,461,354 with 184 days of training time.
Anyone willing to share an estimate how cost will come down each year as hardware keeps improving and possible new methodologies are found?
Personally I would not be surprised if it is possible to train the same dataset for half the cost 12 months from now.
Are there big reasons the training can’t be done SETI at home style - you could even pay people for use of their graphics cards and do the training multiple times on different machines to make sure results weren’t being gamed.
There's Petals[0], but the problem seems to be that the entire training data needs to be loaded into VRAM and can't be split up across devices.