What test cases do folks here recommend for measuring this new model's ability to reason? and, specifically, if it can reason about code with similar (or better!) performance to ChatGPT4? Has anyone managed to get it running locally?

OpenAI has been collecting a ton of evals here https://github.com/openai/evals with many of them including some comments about how well GPT-4 does vs GPT-3.5.

You could clone that repo, adapt the oaieval script to run against different APIs, then run the evals against both and compare the results.