Voice Transcriptionism

Ever sent an iMessage with your voice and have it not work. "Hey Dolph, Take a memo on your newton..."

Speech-to-text models encompass any model that takes in and audio signal as input and spits out a sequence of characters. It has worked it's way into iMessage, Siri and the lot with generally decent results. In my experience, Apple's integrated solution only works with over enunciated speech. When speaking at a natural cadence and tone, however, its dogwater. Luckily, the AI folks can save us.

Feverish efforts have been made to create small, transformer based speech-to-text models that 'should' be able to run on a phone. Of these projects, OpenAI's whisper stands out with a strong open source community. Performance both in terms of speed and accuracy are more than sufficient to subsume Apple's built in option, contributing to the usability of the app:

1. No Cloud - Should one decide to rely on public cloud apis, they risk tanking the usability of the app. Companies that provide such services provide for thousands of people, leading to delays and inconsistencies. Because I'm more important than them, I deserve to have a model on my own hardware; preferably my own phone. 

2. Faster Transfers - As opposed to sending the entire audio signal, On device transcription allows requests to contain only text, a file format that is three orders of magnitude smaller.

3. Accuracy - Whisper is way more accurate than whats built in. It picks up general speech (as it should) as well as subtly. Hums are "hmmm..." as opposed to 'hun,' for example.

These contribute to create a natural experience. The end goal is to have around a 1 second delay between my speech and its response. More on this later...

Comments