OpenAI has been collecting a ton of evals here https://github.com/openai/evals with many of them including some comments about how well GPT-4 does vs GPT-3.5.

You could clone that repo, adapt the oaieval script to run against different APIs, then run the evals against both and compare the results.