Moshi Chat is a new native voice AI model from French startup Kyutai that promises a similar experience to GPT-4o by understanding your tone of voice and being able to be interrupted.
Unlike GPT-4o, Moshi is a smaller model and can be installed locally and run offline. This could be perfect for the future of smart home appliances – if they can improve responsiveness.
I’ve had several conversations with Moshi. In the current online demo, each one lasted up to five minutes and always ended with the same word being repeated over and over, losing context.
During one of the conversations, he started arguing with me, flatly refusing to tell me a story, demanding a fact instead, and not letting up until I said, “Tell me a fact.”
This is probably all a problem of context window size and computational resources, which can be easily solved over time. While OpenAI doesn’t have to worry about competition from Moshi yet, it shows that – as with Sora, where Luma Labs, Runway and others are shaking up the quality of the program – others are catching up.
What is Moshi Chat?
![Test Moshi Chat – AI-powered speech recognition – YouTube](https://img.youtube.com/vi/coroLWOS7II/maxresdefault.jpg)
Look further
Moshi Chat is the brainchild of the Kyutai Research Lab and was developed from scratch six months ago by a team of eight researchers. The goal is to make it open and continue to expand the new model over time, but this is the first freely accessible native generative language AI.
“This new type of technology makes it possible for the first time to communicate with an AI in a smooth, natural and expressive way,” the company said in a statement.
Its core functionality is similar to OpenAI’s GPT-4o, but comes from a much smaller model. It is also ready for use today, while GPT-4o’s enhanced speech functionality will not be generally available until the fall.
The team suggests that Moshi could be used in roleplaying games or even as a trainer to nudge you as you train. The plan is to work with the community and make it open so others can build on it and continue to refine the AI.
It’s a 7 billion parameter multimodal model called Helium, trained on text and audio codecs, while Moshi natively outputs speech-to-speech. It can run on an Nvidia GPU, Apple’s Metal, or a CPU.
What’s next for Moshi?
![Moshi Keynote – Kyutai – YouTube](https://img.youtube.com/vi/hm2IJSKcYvo/maxresdefault.jpg)
Look further
Kyutai hopes that community support will be used to improve Moshi’s knowledge base and factual accuracy. These are limited so far as it is a lightweight base model, but it is hoped that expanding these aspects, combined with the native language, will produce a powerful assistant.
The next step is to further refine and scale the model to enable more complex and longer conversations with Moshi.
In using it and watching the demos, I found it to be incredibly fast and responsive for the first minute, but the longer the conversation goes on, the more disjointed it becomes. Its lack of knowledge is also evident and if you point out a mistake, it gets nervous and goes into a loop of “I’m sorry, I’m sorry, I’m sorry.”
This is not yet a direct competitor to OpenAI’s GPT-4o Advanced Voice, although Advanced Voice is not currently available. But offering an open, locally running model that has the potential to work in the same way is a significant step forward for open source AI development.