In case anyone's interested in running their own benchmark across many LLMs, I've built a generic harness for this at https://github.com/promptfoo/promptfoo.

I encourage people considering LLM applications to test the models on their _own data and examples_ rather than extrapolating general benchmarks.

This library supports OpenAI, Anthropic, Google, Llama and Codellama, any model on Replicate, and any model on Ollama, etc. out of the box. As an example, I wrote up an example benchmark comparing GPT model censorship with Llama models here: https://promptfoo.dev/docs/guides/llama2-uncensored-benchmar.... Hope this helps someone.

ChainForge has similar functionality for comparing : https://github.com/ianarawjo/ChainForge

LocalAI creates a GPT-compatible HTTP API for local LLMs: https://github.com/go-skynet/LocalAI

Is it necessary to have an HTTP API for each model in a comparative study?