What an odd board. The ESP32-S3 is an absolute powerhouse by itself. I really don't see why would you add another (probably pricey) MCU to serve as the master.
What it is capable of (especially considering the price point) is nothing short of incredible. Wake word activation, audio processing (AGC, AEC, etc), audio streaming, even on device speech recognition for up to 400 commands with Multinet. All remarkably performant, easily besting Alexa/Echo in terms of interactivity and response time (even when using an inference server across the internet for speech recognition).
Sure we're down in ESP-IDF land and managing everything we have going on in FreeRTOS is a pain but that's not anything you wouldn't have on any microcontroller. We're also doing a lot considering and generally speaking we "just" throw/pin audio tasks (with varying priority) on core 1 while more-or-less dumping everything else on core 0. Seems to be working well so far!