Nobody has done this well enough yet. What's required:

1. Transcribe your speech using Whisper (in that case you don't have to make an effort to speak clearly so long as you're in a relatively quiet room)

2. Get a TTS system that actually sounds good (e.g. Descript, Eleven Labs, etc.)

3. Have RAPID responses like a normal human conversation (mostly on OpenAI's side... so hopefully ChatGPT Plus fixes that)

The bottleneck is currently TTS. The best option is probably Eleven Labs, but response times are unpredictable. GPT response times can be worked around by falling back to a faster model, but you can't do that with TTS because the voice needs to be consistent. It seems like current state of the art are diffusion models ala DALL-E, see e.g. [1] (the developer, James Betker now incidentally works for OpenAI). It's nontrivial to turn this into something that works in real-time without a decent budget, though.

Whisper (for transcription) is insanely fast and good.

1. https://github.com/neonbjb/tortoise-tts