Chatterbox TTS

(github.com)

Comments

Mizza 11 June 2025
Demos here: https://resemble-ai.github.io/chatterbox_demopage/ (not mine)

This is a good release if they're not too cherry picked!

I say this every time it comes up, and it's not as sexy to work on, but in my experiments voice AI is really held back by transcription, not TTS. Unless that's changed recently.

xnx 11 June 2025
You can run it for free here: https://huggingface.co/spaces/ResembleAI/Chatterbox
travisvn 12 June 2025
Chatterbox is fantastic.

I created an API wrapper that also makes installation easier (Dockerized as well) https://github.com/travisvn/chatterbox-tts-api/

Best voice cloning option available locally by far, in my experience.

teraflop 11 June 2025
> Every audio file generated by Chatterbox includes Resemble AI's Perth (Perceptual Threshold) Watermarker - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.

Am I misunderstanding, or can you trivially disable the watermark by simply commenting out the call to the apply_watermark function in tts.py? https://github.com/resemble-ai/chatterbox/blob/master/src/ch...

I thought the point of this sort of watermark was that it was embedded somehow in the model weights, so that it couldn't easily be separated out. If you're going to release an open-source model that adds a watermark as a separate post-processing step, then why bother with the watermark at all?

pryelluw 11 June 2025
Silly question, what’s the lowest spec hardware this will run ?
ineedasername 11 June 2025
The emotional exaggeration is interesting, though I don't think I've come across anything quite so versatile and easy to "sculpt" as Elevenlabs and it's ability to generate a voice on the basis of a description of how you want the voice to sound. SparkTTS allows some additional parameters, and it's project on GitHub has placeholders in its code that indicate the model might be refined for more fine grained emotional control. As it is, I've had some success with it and other models by trying to influence prosody and tonality with some heavy handed queues in the text, which can then be used with VC to get closer to desired results, but it's a much more cumbersome process than Eleven.
nmstoker 11 June 2025
I've found it excellent with really common accents but with other accents (that are pretty common too) it can easily get stuck picking a different accent. For instance several Scottish recordings ended up Australian, likewise a fairly mild Yorkshire accent
audiala 12 June 2025
What is the current state of the art for open source multilingual TTS? I have found Kokoro to be great as English as well, but am still searching for a good solution for French, Japanese, German...
abraxas 11 June 2025
Are these things good enough to narrate a book convincingly or does the voice lose coherence after a few paragraphs being spoken?
philipkiely 12 June 2025
Example implementation with sample inference code + voice cloning example:

https://github.com/basetenlabs/truss-examples/tree/main/chat...

Still working on streaming

tevon 12 June 2025
I just tested it out locally, really excellent quality, the server was easy to set up and well documented.

I'd love to get to real-time generation if that's in the pipeline? Would like to use it along with Home Assistant.

iambateman 12 June 2025
Just a regular reminder to tell your friends and family to be extra skeptical about phone conversations.

It’s becoming much more likely that the friend who desperately needs a gift card to Walmart isn’t the friend at all. :(

lukeinator42 18 hours ago
Does anyone know of an open-source TTS like this that can also encode speech to do voice conversion alongside TTS? i.e. a model that would take speech as input and convert it to one of the pretrained TTS voices.
DHolzer 9 hours ago
I love chatterbox, it's my favourite. While the generation speed is quick, i wonder what performance optimization i could try on my 3090 to improve throughput. It's not quite enough for realtime.
stevage 11 June 2025
Interesting demo. A few observations, having uploaded a snippet of my own voice, and testing with some of my own text:

- the output had some of the qualities of my voice, but wasn't super similar. (Then again, the fact it could even do this from such a tiny snippet was impressive)

- increasing "CFG/pace" (whatever CFG is) even a little bit often just breaks down into total gibberish

- it was very inconsistent whether it would come out with a kind of British accent or an American one. (My accent is Australian...)

- the emotional exaggeration was interesting, but it seemed to vary a lot exactly what kind of emotion would come out

b0a04gl 22 hours ago
> the emotion intensity control is killer. actual param you can tune per line. > and the perth watermarking baked into every output, that’s the part most people are sleeping on. survives mp3, editing, even resampling. no plugin, no postprocess. > also noticed the chatterboxtoolkitui floating in the org, with audiobook mode and batch voice conversion already wired in.

is it a banger??? yes ig so, a full setup ready for indies shipping voicefirst products right now.

j2kun 11 June 2025
They should put the meaning of "TTS" in the readme somewhere, probably near the top. Or their website.
palmfacehn 12 June 2025
Has anyone developed a way to annotate the input to provide emotional context?

In the past I've used different samples from the same speaker for this.

ojw0816 12 hours ago
Looks good! What is the difference between the open-source version and the priced version?
pzo 12 June 2025
It's only for English sadly
racecar789 12 June 2025
I’d sign up for a service that calls a pharmacy on my behalf to refill prescriptions. In certain situations, pharmacies will not list prescriptions on their websites, even though they have the prescriptions on file, which forces the customer to call by phone — a frustrating process.

I do feel bad for pharmacists, their job is challenging in so many ways.

bachittle 23 hours ago
I always have issues with TTS models that do not allow you to send large chunks of text. Seems this one does not resolve this either. Always has a limit of like 2-3 sentences.
monksy 12 hours ago
How would I install this alongside librechat or ollama using docker?
MrThoughtful 12 June 2025
How do you set the voice?

On the Huggingface demo, there seems to be no option for it.

It has a female voice. Any way to set it to a male voice?

Shopper0552 11 June 2025
Anyone know a good free open source speech to text? Looking for something for my laptop which is running Fedora KDE plasma.
ipsum2 12 June 2025
The voice cloning is okay, not as good as Eleven Labs. There's a Rick (from Rick and Morty) voice example, and the generated audio sounds muffled and low quality. I appreciate that its open source though.
kiririn7 11 June 2025
definitely worse than the new elevenlabs model(v3). that model is really good
andy_xor_andrew 11 June 2025
in my experience, TTS has been a "pick two" situation:

- fast / cheap to run

- can clone voices

- sounds super realistic

from what I can tell, Chatterbox is the first that apparently lets you pick 3! (have not tried it myself yet, this is just what I can deduce)

andymcsherry 12 June 2025
Here's an open-source serving implementation: https://lightning.ai/bhimrajyadav/studios/build-a-production...

Also, a deployable model: https://lightning.ai/bhimrajyadav/ai-hub/temp_01jwr0adpqf055...

3ds 12 June 2025
There are only english voices, even in the paid version. Using them in other languages results in an accent.
init0 12 June 2025
az226 11 June 2025
How does one train a TTS model with an LLM backbone? Practically, how does this work?
ash1224 8 hours ago
wow! 200mms very good!
benob 12 June 2025
Watermarking is easily disabled in the code. I a wondering when they will release model weights with embedded watermarking.
decide1000 11 June 2025
How does it perform on multi-lingual tasks?
SV_BubbleTime 18 hours ago
Fun stuff... I don't know how or why, but connecting bluetooth while on this site, made all of the audio clips play at once (Firefox, Linux). Not the best listening experience.
causality0 11 June 2025
Anyone know how this compares to Kokoro? I've found Kokoro very useful for generating audiobook but it almost always pronounces words with paired vowels incorrectly. Daisy becomes die-zee, leave becomes lay-ve, etc.
pradeepodela 12 June 2025
What is the latency?
tuananh 12 June 2025
for this, what does it take to support another language?
internet_points 12 June 2025
> Supported Lanugage

> Currenlty only English.

meh

_andrei_ 12 June 2025
very cherry picked
andrewstuart 12 June 2025
There’s been surprisingly little advancement in TTS after a rapid leap forward three years ago or so.

There’s eleven labs which is quite good but not incredible and very expensive.

Everything else ……. all the big AI companies …. have TTS systems that are kinda meh.

Everything else in AI has advanced in leaps and bounds, TTS remains deep in the uncanny valley.

hsavit1 12 June 2025
another TTS that is only supporting English. This really irritates me
andyferris 12 June 2025
It took me ages to understand what TTS means!