This is a good release if they're not too cherry picked!
I say this every time it comes up, and it's not as sexy to work on, but in my experiments voice AI is really held back by transcription, not TTS. Unless that's changed recently.
> Every audio file generated by Chatterbox includes Resemble AI's Perth (Perceptual Threshold) Watermarker - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.
I thought the point of this sort of watermark was that it was embedded somehow in the model weights, so that it couldn't easily be separated out. If you're going to release an open-source model that adds a watermark as a separate post-processing step, then why bother with the watermark at all?
The emotional exaggeration is interesting, though I don't think I've come across anything quite so versatile and easy to "sculpt" as Elevenlabs and it's ability to generate a voice on the basis of a description of how you want the voice to sound. SparkTTS allows some additional parameters, and it's project on GitHub has placeholders in its code that indicate the model might be refined for more fine grained emotional control. As it is, I've had some success with it and other models by trying to influence prosody and tonality with some heavy handed queues in the text, which can then be used with VC to get closer to desired results, but it's a much more cumbersome process than Eleven.
I've found it excellent with really common accents but with other accents (that are pretty common too) it can easily get stuck picking a different accent.
For instance several Scottish recordings ended up Australian, likewise a fairly mild Yorkshire accent
What is the current state of the art for open source multilingual TTS? I have found Kokoro to be great as English as well, but am still searching for a good solution for French, Japanese, German...
Does anyone know of an open-source TTS like this that can also encode speech to do voice conversion alongside TTS? i.e. a model that would take speech as input and convert it to one of the pretrained TTS voices.
I love chatterbox, it's my favourite. While the generation speed is quick, i wonder what performance optimization i could try on my 3090 to improve throughput.
It's not quite enough for realtime.
Interesting demo. A few observations, having uploaded a snippet of my own voice, and testing with some of my own text:
- the output had some of the qualities of my voice, but wasn't super similar. (Then again, the fact it could even do this from such a tiny snippet was impressive)
- increasing "CFG/pace" (whatever CFG is) even a little bit often just breaks down into total gibberish
- it was very inconsistent whether it would come out with a kind of British accent or an American one. (My accent is Australian...)
- the emotional exaggeration was interesting, but it seemed to vary a lot exactly what kind of emotion would come out
> the emotion intensity control is killer. actual param you can tune per line.
> and the perth watermarking baked into every output, that’s the part most people are sleeping on. survives mp3, editing, even resampling. no plugin, no postprocess.
> also noticed the chatterboxtoolkitui floating in the org, with audiobook mode and batch voice conversion already wired in.
is it a banger???
yes ig so, a full setup ready for indies shipping voicefirst products right now.
I’d sign up for a service that calls a pharmacy on my behalf to refill prescriptions. In certain situations, pharmacies will not list prescriptions on their websites, even though they have the prescriptions on file, which forces the customer to call by phone — a frustrating process.
I do feel bad for pharmacists, their job is challenging in so many ways.
I always have issues with TTS models that do not allow you to send large chunks of text. Seems this one does not resolve this either. Always has a limit of like 2-3 sentences.
The voice cloning is okay, not as good as Eleven Labs. There's a Rick (from Rick and Morty) voice example, and the generated audio sounds muffled and low quality. I appreciate that its open source though.
Fun stuff... I don't know how or why, but connecting bluetooth while on this site, made all of the audio clips play at once (Firefox, Linux). Not the best listening experience.
Anyone know how this compares to Kokoro? I've found Kokoro very useful for generating audiobook but it almost always pronounces words with paired vowels incorrectly. Daisy becomes die-zee, leave becomes lay-ve, etc.
Chatterbox TTS
(github.com)639 points by pinter69 11 June 2025 | 183 comments
Comments
This is a good release if they're not too cherry picked!
I say this every time it comes up, and it's not as sexy to work on, but in my experiments voice AI is really held back by transcription, not TTS. Unless that's changed recently.
I created an API wrapper that also makes installation easier (Dockerized as well) https://github.com/travisvn/chatterbox-tts-api/
Best voice cloning option available locally by far, in my experience.
Am I misunderstanding, or can you trivially disable the watermark by simply commenting out the call to the apply_watermark function in tts.py? https://github.com/resemble-ai/chatterbox/blob/master/src/ch...
I thought the point of this sort of watermark was that it was embedded somehow in the model weights, so that it couldn't easily be separated out. If you're going to release an open-source model that adds a watermark as a separate post-processing step, then why bother with the watermark at all?
https://github.com/basetenlabs/truss-examples/tree/main/chat...
Still working on streaming
I'd love to get to real-time generation if that's in the pipeline? Would like to use it along with Home Assistant.
It’s becoming much more likely that the friend who desperately needs a gift card to Walmart isn’t the friend at all. :(
- the output had some of the qualities of my voice, but wasn't super similar. (Then again, the fact it could even do this from such a tiny snippet was impressive)
- increasing "CFG/pace" (whatever CFG is) even a little bit often just breaks down into total gibberish
- it was very inconsistent whether it would come out with a kind of British accent or an American one. (My accent is Australian...)
- the emotional exaggeration was interesting, but it seemed to vary a lot exactly what kind of emotion would come out
is it a banger??? yes ig so, a full setup ready for indies shipping voicefirst products right now.
In the past I've used different samples from the same speaker for this.
I do feel bad for pharmacists, their job is challenging in so many ways.
On the Huggingface demo, there seems to be no option for it.
It has a female voice. Any way to set it to a male voice?
- fast / cheap to run
- can clone voices
- sounds super realistic
from what I can tell, Chatterbox is the first that apparently lets you pick 3! (have not tried it myself yet, this is just what I can deduce)
Also, a deployable model: https://lightning.ai/bhimrajyadav/ai-hub/temp_01jwr0adpqf055...
> Currenlty only English.
meh
There’s eleven labs which is quite good but not incredible and very expensive.
Everything else ……. all the big AI companies …. have TTS systems that are kinda meh.
Everything else in AI has advanced in leaps and bounds, TTS remains deep in the uncanny valley.
https://news.ycombinator.com/item?id=44120204
https://news.ycombinator.com/item?id=44144155
https://news.ycombinator.com/item?id=44195105
https://news.ycombinator.com/item?id=44230867
https://news.ycombinator.com/item?id=44172134
https://news.ycombinator.com/item?id=44221910
https://news.ycombinator.com/item?id=44145564