This project is not a scam

OpenAI peed on my rug - It's tough being me.

Ok, so... I'm not only one in the industry who wants an iPhone app they can talk to. The component technologies (speech generation, text generation, voice transcription) have been of a high enough caliber such that some company could've packaged them up and shipped them in a product. Of the companies that could release such a product, it was going to be OpenAI who could've come up with 'something' a 1 year ago.

GPT4-o brings a fundamental breakthrough to the scene that explains away a lot of the 'packaging.' GP4-o - the 'o' is omni - accepts a plethora of inputs which gives it the 'multimodal' designation. Multimodal models have have existed previously but we have yet to see one that encapsulates all the conceivable input types (audio, text, images, video) at the same time. For Sal, user input is being handled in a 3 stage pipeline with one model handling each task. At the beginning, its a speech-to-text model running on my iPhone that spits out input text (check articles on voice transcription). Then, that text is passed to a text-to-text transformer which spits out Sal's response text (check articles on the transformer). Finally, Sal's output text is put into text-to-speech model that spits out his audio response (article coming soon).

Instead of a model for each task (speech-to-text, text-to-text, text-to-speech, text-to-image, image-to-text), GPT-o brings everything under one umbrella model that takes every type as input and gives any type as output: A one stage pipeline. Without going into details, it makes the model more capable of natural conversation.

Breakthroughs aside, the app's design bothers me. Both apps share basically the same layout with a fft visualization at the top and buttons at the bottom.

I have a nemesis now.

Search This Blog

My Robot, Sal.

This project is not a scam

Comments

Post a Comment