Thanks, of course! For wake word and voice activity detection, audio processing,...

Thanks, of course!

For wake word and voice activity detection, audio processing, etc we use the ESP SR (speech recognition) framework from Espressif[0]. For speech to text there are two options and more to come:

1) Completely on device command recognition using the ESP SR Multinet 6 model. Willow will (currently) pull your light and switch entities from Home Assistant and generate the grammar and command definition required by Multinet. We want to develop a Willow Home Assistant component that will provide tighter Willow integration with HA and allow users to do this point and click with dynamic updates for new/changed entities, different kinds of entities, etc all in the HA dashboard/config.

The only "issue" with Multinet is that it only supports 400 defined commands. You're not going to get something like "What's the weather like in $CITY?" out of it.

For that we have:

2-?) Our own highly optimized inference server using Whisper, LLamA/Vicuna, and Speecht5 from transformers (more to come soon). We're open sourcing it next week. Willow streams audio after wake in realtime, gets the STT output, and sends it wherever you want. With the Willow Home Assistant component (doesn't exist yet) it will sit in between our inference server implementation doing STT/TTS or any other STT/TTS implementation supported by Home Assistant and handle all of this for you - including chaining together other HA components, APIs, etc.

[0] - https://github.com/espressif/esp-sr