Chatbots be they voice or text face the same user interface problems that old school command-line interfaces did 20 years ago before windowing GUIs came about. When you start up, you're faced with a void. it's extremely difficult to convey to users what the acceptable entry points are and how to phrase them to get what you want. The inputs are far more generous than classic command-line world, but it's still vague enough to induce paralysis in end users. If I have a bunch of pull-down menus with clear directives on what I can and can't do, I'm going to be productive much more quickly.
I think the ultimate goal is to make the "acceptable entry points" so numerous and the variety of acceptable wordings so broad that you can approach the assistant with pretty much any goal you have in mind and it'll walk you through how to accomplish that.
Imagine if this was a realistic conversation with an assistant:
> "Hey Google, I'd like to order a pizza."
> "Sure, what kind?"
> "Let's see... cheese, pepperoni, sausage... and maybe some green pepers?"
> "Alright. What size?"
> "Hmm, so I need to feed 4 people..."
> "Sounds like a large?"
> "Sure, let's go with a large."
> "Alright. There's a Dominos nearby, I can order that for $8.99."
> "Sounds good."
> "Alright, I've ordered your pizza. Expected delivery in 15 minutes."
No need for the human to understand what the "entry point" is, because you can approach the assistant with pretty much _any_ entry point and it'll give you a useful response. We're still not there yet, unfortunately, and I think it'll be quite a while before we are.
This isn't that far away. If Dominos had an external API you could build this right now.
The only difference is that you would want a confirmation step with the VUI - i.e. "Alright, you want a large cheese, pepperoni, sausage, & pepper pizza from Dominos, which will take 15 minutes to deliver. Place the order?"