Show HN: Dia, an open-weights TTS model for generating realistic dialogue Hackernews Viewer

Show HN: Dia, an open-weights TTS model for generating realistic dialogue

(github.com)

642 points by toebee 21 April 2025 | 191 comments

Comments

sebstefan 22 April 2025

I inserted the non-verbal command "(pauses)" in the middle of a sentence and I think I caused it to have an aneurysm.

https://i.horizon.pics/4sEVXh8GpI (27s)

It starts with an intro, too. Really strange

hemloc_io 21 April 2025

Very cool!

Insane how much low hanging fruit there is for Audio models right now. A team of two picking things up over a few months can build something that still competes with large players with tons of funding

Versipelle 21 April 2025

This is really impressive; we're getting close to a dream of mine: the ability to generate proper audiobooks from EPUBs. Not just a robotic single voice for everything, but different, consistent voices for each protagonist, with the LLM analyzing the text to guess which voice to use and add an appropriate tone, much like a voice actor would do.

I've tried "EPUB to audiobook" tools, but they are really miles behind what a real narrator accomplishes and make the audiobook impossible to engage with

tyrauber 21 April 2025

Hey, do yourself a favor and listen to the fun example:

> [S1] Oh fire! Oh my goodness! What's the procedure? What to we do people? The smoke could be coming through an air duct!

Seriously impressive. Wish I could direct link the audio.

Kudos to the Dia team.

toebee 21 April 2025

Hey HN! We’re Toby and Jay, creators of Dia. Dia is 1.6B parameter open-weights model that generates dialogue directly from a transcript.

Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.

It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.

Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-fir-7a4.notion.site/dia

We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.

So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.

Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.

We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.

notdian 21 April 2025

made a small change and got it running on M2 Pro 16GB Macbook pro, the quality is amazing.

https://github.com/nari-labs/dia/pull/4

rustc 21 April 2025

Is this Apache licensed or a custom one? The README contains this:

> This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

> This project offers a high-fidelity speech generation model *intended solely for research and educational use*. The following uses are strictly forbidden:

> Identity Misuse: Do not produce audio resembling real individuals without permission.

> ...

Specifically the phrase "intended solely for research and educational use".

moritonal 22 April 2025

Isn't it weird how "We don't have a full list of non-verbal [commands]". Like, I can imagine why, but it's wild we're at a point where we don't know what our code can do.

devnen 22 April 2025

This is really impressive work and the dialogue quality is fantastic.

For anyone wanting a quick way to spin this up locally with a web UI and API access, I put together a FastAPI server wrapper around the model: https://github.com/devnen/Dia-TTS-Server

The setup is just a standard pip install -r requirements.txt (works on Linux/Windows). It pulls the model from HF automatically – defaulting to the faster BF16 safetensors (ttj/dia-1.6b-safetensors), but that's configurable in the .env. You get an OpenAI-compatible API endpoint (/v1/audio/speech) for easy integration, plus a custom one (/tts) to control all the Dia parameters. The web UI gives you a simple way to type text, adjust sliders, and test voice cloning. It'll use your CUDA GPU if you have one configured, otherwise, it runs on the CPU.

Might be a useful starting point or testing tool for someone. Feedback is welcome!

strobe 21 April 2025

just in case, another opensource project using same name https://wiki.gnome.org/Apps/Dia/

https://gitlab.gnome.org/GNOME/dia

Havoc 21 April 2025

Sounds really good & human! Got a fair bit of unexpected artifacts though. e.g. 3 seconds hissing noise before dialogue. And music in background when I added (happy) in an attempt to control tone. Also don't understand how to control the S1 and S2 speakers...is it just random based on temp?

> TODO Docker support

Got this adapted pretty easily. Just latest nvidia cuda container, throw python and modules on it and change server to serve on 0.0.0.0. Does mean it pulls the model every time on startup though which isn't ideal

xbmcuser 21 April 2025

Wow first time I have felt that this could be the end of voice acting/audio book narration etc. The speed with with the ways things are changing how soon before you can make any book any novel into a complete audio video / movie or tv show.

stuartjohnson12 21 April 2025

Impressive project! We'd love to use something like this over at Delfa (https://delfa.ai). How does this hold up from the perspective of stability? I've spoken to various folks working on voice models, and one thing that has consistently held Eleven Labs ahead of the pack from my experience is that their models seem to mostly avoid (while albeit not being immune to) accent shifts and distortions when confronted with unfamiliar medical terminology.

A high quality, affordable TTS model that can consistently nail medical terminology while maintaining an American accent has been frustratingly elusive.

dindindin 21 April 2025

Was this trained on Planet Money / NPR podcasts? The last audio (continuation of prompt) sounds eerily like Planet Money, I had to double check if my Spotify had accidentally started playing.

codingmoh 21 April 2025

Hey, this is really cool! Curious how good the multi-language support is. Also - pretty wild that you trained the whole thing yourselves, especially without prior experience in speech models.

Might actually be helpful for others if you ever feel like documenting how you got started and what the process looked like. I’ve never worked with TTS models myself, and honestly wouldn’t know where to begin. Either way, awesome work. Big respect.

toebee 21 April 2025

It is way past bedtime here, will be getting back to comments after a few hours of sleep! Thanks for all the kind words and feedback

sarangzambare 21 April 2025

Impressive demo! We'd love to use this at https://useponder.ai

time to first audio is something that is crucial for us to reduce the latency - wondering if dia works with output streaming?

the python code snippet seems to imply that the entire audio bytes are generated directly?

toebee 22 April 2025

We have a ZeroGPU Space provided by HuggingFace up and running! Test it now on https://huggingface.co/spaces/nari-labs/Dia-1.6B

eob 21 April 2025

Bravo -- this is fantastic.

I've been waiting for this ever since reading some interview with Orson Scott Card ages ago. It turns out he thinks of his novels as radio theater, not books. Which is a very different way to experience the audio.

IshKebab 21 April 2025

Why does it say "join waitlist" if it's already available?

Also, you don't need to explicitly create and activate a venv if you're using uv - it deals with that nonsense itself. Just `uv sync`.

LarsDu88 22 April 2025

Fantastic model. I'm going to write a Unity plugin for this. Have been using ElevenLabs for my VR game side project, but this appears to be better

a2128 21 April 2025

What's the training process like? I have some data in my language I'd love to use train it on my language seeing as it's English-only

999900000999 21 April 2025

Does this only support English?

I would absolutely love something like this for practicing Chinese, or even just adding Chinese dialogue to a project.

oehtXRwMkIs 21 April 2025

Any plans for AMD GPU support? Maybe I'm missing something, but it's not working out of the box on a 7900xtx.

lostmsu 21 April 2025

Does this only work for two voices? Can I generate an entire conversation between multiple people? Like this HN thread.

isoprophlex 21 April 2025

Incredible quality demo samples, well done. How's the performance for multilingual generation?

nicman23 23 April 2025

if someone writes a wrapper with named entity recognition (with spacy for example) and a chapter tagging, it would be a great audiobook converter

ivape 21 April 2025

Darn, don't have the appropriate hardware.

The full version of Dia requires around 10GB of VRAM to run.

If you have a 16gb of VRAM, I guess you could pair this with a 3B param model along side it, or really probably only 1B param with reasonable context window.

howon92 22 April 2025

Training an audio model this good from 0 prior experience is really amazing. I would love to read a blog post about how you guys approached ramping up knowledge and getting practical quickly. Any plans?

enodios 21 April 2025

The audio quality is seriously impressive. Any plans to add word-level timing maps? For my usecase that is a requirement, so unfortunately I cannot use this yet, but I would very much like to.

youssefabdelm 21 April 2025

Anyone know if possible to fine-tune for cloning my voice?

verghese 21 April 2025

How does this compare with Spark TTS?

https://github.com/SparkAudio/Spark-TTS

popalchemist 21 April 2025

This looks excellent, thank you for releasing openly.

xienze 21 April 2025

How do you declare which voice should be used for a particular speaker? And can it created a cloned speaker voice from a sample?

pzo 21 April 2025

Sounds great. Hope more language support in the future. In comparison Sesame CSM-1B sounds like trained on stoned people.

brumar 21 April 2025

Impressive! Is it english only at the moment?

instagary 21 April 2025

Does this use the the mimi codec by moshi? If so it would be straighforward to get Dia running on iOS!

popalchemist 23 April 2025

Is there a limit to the number of speakers?

dangoodmanUT 21 April 2025

Has the same issue of cutting off the end of the provided text that many other models have.

noiv 21 April 2025

The demo page does fancy stuff when marking text and hitting cmd-d to create a bookmark :)

gitroom 22 April 2025

pretty cool - love seeing open stuff like this come together so fast. you think all this will ever match what a real voice actor can pull off, or something totally new comes out of it?

basilgohar 22 April 2025

Seeing is no longer believing. Hearing isn't either. The funny thing is, it's getting to the point where LLM-generated text is more easily spotted than AI audio, video, and images.

It's going to be an interesting decade of the new equivalent of "No, Tiffany, Bill Gates will NOT be sending you $100 for forwarding that email." Except it's going to be AI celebrities making appeals for donations to help them become billionaires or something.

user_4028b09 22 April 2025

This is a really impressive project – looking forward to trying it out!

vagabund 21 April 2025

The huggingface spaces link doesn't work, fyi.

Sounds awesome in the demo page though.

xhkkffbf 21 April 2025

Are there different voices? Or only [s1] and [s2] in the examples?

elia_42 22 April 2025

Interesting. I will definitely look into it further.

jokethrowaway 21 April 2025

Looking forward to try. My current go-to solution is E5-F2 (great cloning, decent delivery, ok audio quality, a lot of incoherence here and there forcing you to do multiple generations).

I've just been massively disappointed by Sesame's CSM: on their gradio on the website it was generating flawless dialogs with amazing voice cloning. When running it local the voice cloning performance is awful.

sroussey 22 April 2025

Can we get this working in the browser a la ONNX or similar?

qwertytyyuu 22 April 2025

That Sesame CSM-1B voice sounds sooo done with life, haha.

flashblaze 22 April 2025

I'm lost for words. This is extremely impressive!

film42 21 April 2025

Very very impressive.

gamificationpan 22 April 2025

Thank you! Awesome resources.

zhyder 21 April 2025

V v cool: first time I've seen such expressiveness in TTS for laughs, coughs, yelling about a fire, etc!

What're the recommended GPU cloud providers for using such open-weights models?

hiAndrewQuinn 21 April 2025

Is this English-only? I'm looking for a local model for Finnish dialogue to run.

bazlan 22 April 2025

fluxions.ai has a similar model

benterix 22 April 2025

Whoah!

jackchina 22 April 2025

test good!

mclau157 21 April 2025

Will you support the other side with AI voice detection software to detect and block malicious voice snippets?