The Voice: Part 1

A man once said, "if you want a robot to talk to you, it should have a sassy voice."


Above, is Michael B. Paulson, also known as the primeagen on twitch. He was just born with one of those voices that is easy to read because it matches his personality. Both are whiney, kinetic, funny, overcooked, intelligent, and something you just want to witness. It would be great if Sal used his voice.

(A CLIP OF HIS VOICE - INSERT HERE)

Thankfully, we have the technology! Though I don't have TikTok myself, I have seen a couple videos featuring AI generated content. It is quite easy for anyone to generate a voice of some celebrity using publicly available tools. Though said tools produce accurate voices, they are quite slow, taking 30sec - 2min for just one sentence.

Because of the time and compute budgets of Sal (sub 1 second response time), I sought out a open source models in the ~100 million parameter range (lower parameters means faster generations - Sal's LLM is 8 billion parameters). Lower parameter counts will constrain the hypothetical quality of generations but hopefully not by tool much. Users will, ultimately, be the judge.

Contrary to what one might think, the difficulty in voice cloning isn't to be found in the choice of model; its the data.* Curating a clean, extensive dataset of one person's voice and its accompanies transcription can be quite difficult. Ideal voice datasets are built atop hours of speech in a sound controlled environment where the reader reading a given text - basically and audiobook. Readers keep a flat tone and read content that is diverse in its wording. These traits allow the model to take in the breadth of English speech in a consistent tone. Hollywood celebrities are quite difficult to clone. They can almost never be found in a sound controlled, quite environment where only they are talking. Micheal B. Paulson is somewhere in between.

*This trend isn't unique to voice cloning. You'll find that ALL companies keep their data curating methods a trade secret.*

His Twitch channel consists of him reacting to various software articles on the internet for hours at a time. He records in a sound controlled room with a nice mic, making the audio signals clean and easy to listen too. I wrote an script that downloads content off his channel, transcribes the audio, divides it into chunks based the sentence length usually (3-7 seconds), and then saves the audio-text as a pair. 

After training MatchaTTS for the past week, training loss (a measure of error - convergence is desired) has converged a fair amount, but the generations are still ify. After looking at the dataset some more, I need to put some more work into 'cleaning' it - some of the clips are cut too short. Below are the loss curves and a sample from the model (it sounds terrible). More on this too come...













Comments