Has anyone used or is currently using 7B models in a production or commercial product? How was the performance? What kind of tasks were you using it for? Was it practical to use the small 7B model for your specific use case, or did you switch to OpenAI models or 30-70B open source models?

I'm using a mix of 7B and 13B models that have been fine-tuned using LoRA for specific tasks and they work fantastically depending on the specific task at hand _after fine-tuning_. Generally they're kind of garbage in my experience without fine tuning but I haven't tested the base models directly for tasks besides the statistics at the beginning of the training run.

As for performance, I'm generally seeing 40-50 tokens/sec per model on a Tesla family Nvidia GPU but I keep multiple models loaded and active at a time so that estimate is probably a bit low for overall throughput (I also realized that our monitoring doesn't have any cumulative GPU token rate metrics just now thanks to this question hahah).

Interesting anecdote others may be interested in... I'm rate limiting the output from our streaming API to 8 tokens/sec to artificially smooth out front-end requests. Interactive users will wait and even prefer seeing the stream of the response, and non-interactive users tend to base their performance expectations on the what the streaming API does. It's kind of sneaky but I'm also artificially slowing down those API requests.

spaghetti1535

We're looking into fine-tuning and using 7B and 13B models and while we understand most of the mechanics we are somewhat overwhelmed by the amount of options available and are unsure where to start. Do you recommend any open source frameworks for fine-tuning and running models? Additionally, are you open to and available for consulting in this area?

TrueDuality

I appreciate the offer but I'm a bit underwater with the amount that I have on my plate right now. We're using a custom solution in-house for all of our training and hosting and it can definitely be daunting to get that far.

I'm not sure how experienced you are in the field but there are kind of two levels of fine-tuning, full fine-tuning (update all the weights of the model, usually requires 2-3x the memory required for inference). This allows you to change and update the knowledge contained inside the model.

If the model has sufficient understanding already of the content of the task and you want to change how it responds, such as to a specific output format, "personality" or "flavor" of output, or to have it already know the kind of task its performing without including those details in the prompt I would go with parameter efficient fine-tuning.

If you're looking to do a one-off train for a model, you might be able to get away with doing it in something like this: https://github.com/oobabooga/text-generation-webui Very easy to use project but it really doesn't allow for the kind of metrics, analysis, or professional grade hosting you'll want.

vLLM can help with the hosting and is really solid once you have the models fine-tuned we tried that at first but its core architecture simply wouldn't work for what we were trying to do which is why we went fully in-house.

Once you get into a lot of fine-tuning, you're probably going to want to do it directly in pytorch or the equivalent for your language of choice. A good resource for seeing how people do this is actually opensource models published on hugging face. Look for some LoRA models, or fine tunes similar to what you'd like. A lot of people publish their training code and datasets on GitHub which can be very useful references.

Right now I'd recommend llama2 as a base model for most general language model based tasks if you don't cross their commercial use threshold (which is very very generous).

Hope this helps!