As for performance, I'm generally seeing 40-50 tokens/sec per model on a Tesla family Nvidia GPU but I keep multiple models loaded and active at a time so that estimate is probably a bit low for overall throughput (I also realized that our monitoring doesn't have any cumulative GPU token rate metrics just now thanks to this question hahah).
Interesting anecdote others may be interested in... I'm rate limiting the output from our streaming API to 8 tokens/sec to artificially smooth out front-end requests. Interactive users will wait and even prefer seeing the stream of the response, and non-interactive users tend to base their performance expectations on the what the streaming API does. It's kind of sneaky but I'm also artificially slowing down those API requests.
I'm not sure how experienced you are in the field but there are kind of two levels of fine-tuning, full fine-tuning (update all the weights of the model, usually requires 2-3x the memory required for inference). This allows you to change and update the knowledge contained inside the model.
If the model has sufficient understanding already of the content of the task and you want to change how it responds, such as to a specific output format, "personality" or "flavor" of output, or to have it already know the kind of task its performing without including those details in the prompt I would go with parameter efficient fine-tuning.
If you're looking to do a one-off train for a model, you might be able to get away with doing it in something like this: https://github.com/oobabooga/text-generation-webui Very easy to use project but it really doesn't allow for the kind of metrics, analysis, or professional grade hosting you'll want.
vLLM can help with the hosting and is really solid once you have the models fine-tuned we tried that at first but its core architecture simply wouldn't work for what we were trying to do which is why we went fully in-house.
Once you get into a lot of fine-tuning, you're probably going to want to do it directly in pytorch or the equivalent for your language of choice. A good resource for seeing how people do this is actually opensource models published on hugging face. Look for some LoRA models, or fine tunes similar to what you'd like. A lot of people publish their training code and datasets on GitHub which can be very useful references.
Right now I'd recommend llama2 as a base model for most general language model based tasks if you don't cross their commercial use threshold (which is very very generous).
Hope this helps!