We also use GPT to perform actions in the software I build at work and we hit the same issue of inconsistency which lead me down a long rabbit hole to see if I could force an LLM to only emit grammatically correct output that follows a bespoke DSL (a DSL ideally safer and more precise than just eval'ing random AI-produced Python).

I just finished writing up a long post [1] that describes how this can work on local models. It's a bit tricky to do via API efficiently, but hopefully OpenAI will give us the primitives one day to appropriately steer these models to do the right things (tm).

[1] https://github.com/newhouseb/clownfish