Somewhat shameless but relevant plug - we use the ESP32-S3 for Willow[0]. The dual core S3 and "high speed" external PSRAM are game changers.

What it is capable of (especially considering the price point) is nothing short of incredible. Wake word activation, audio processing (AGC, AEC, etc), audio streaming, even on device speech recognition for up to 400 commands with Multinet. All remarkably performant, easily besting Alexa/Echo in terms of interactivity and response time (even when using an inference server across the internet for speech recognition).

Sure we're down in ESP-IDF land and managing everything we have going on in FreeRTOS is a pain but that's not anything you wouldn't have on any microcontroller. We're also doing a lot considering and generally speaking we "just" throw/pin audio tasks (with varying priority) on core 1 while more-or-less dumping everything else on core 0. Seems to be working well so far!

[0] - https://github.com/toverainc/willow