Insane how much low hanging fruit there is for Audio models right now. A team of two picking things up over a few months can build something that still competes with large players with tons of funding
This is really impressive; we're getting close to a dream of mine: the ability to generate proper audiobooks from EPUBs. Not just a robotic single voice for everything, but different, consistent voices for each protagonist, with the LLM analyzing the text to guess which voice to use and add an appropriate tone, much like a voice actor would do.
I've tried "EPUB to audiobook" tools, but they are really miles behind what a real narrator accomplishes and make the audiobook impossible to engage with
Hey HN! We’re Toby and Jay, creators of Dia. Dia is 1.6B parameter open-weights model that generates dialogue directly from a transcript.
Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.
It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.
We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.
So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.
Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.
We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.
Is this Apache licensed or a custom one? The README contains this:
> This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
> This project offers a high-fidelity speech generation model *intended solely for research and educational use*. The following uses are strictly forbidden:
> Identity Misuse: Do not produce audio resembling real individuals without permission.
> ...
Specifically the phrase "intended solely for research and educational use".
Isn't it weird how "We don't have a full list of non-verbal [commands]". Like, I can imagine why, but it's wild we're at a point where we don't know what our code can do.
This is really impressive work and the dialogue quality is fantastic.
For anyone wanting a quick way to spin this up locally with a web UI and API access, I put together a FastAPI server wrapper around the model:
https://github.com/devnen/Dia-TTS-Server
The setup is just a standard pip install -r requirements.txt (works on Linux/Windows). It pulls the model from HF automatically – defaulting to the faster BF16 safetensors (ttj/dia-1.6b-safetensors), but that's configurable in the .env. You get an OpenAI-compatible API endpoint (/v1/audio/speech) for easy integration, plus a custom one (/tts) to control all the Dia parameters. The web UI gives you a simple way to type text, adjust sliders, and test voice cloning. It'll use your CUDA GPU if you have one configured, otherwise, it runs on the CPU.
Might be a useful starting point or testing tool for someone. Feedback is welcome!
Sounds really good & human! Got a fair bit of unexpected artifacts though. e.g. 3 seconds hissing noise before dialogue. And music in background when I added (happy) in an attempt to control tone. Also don't understand how to control the S1 and S2 speakers...is it just random based on temp?
> TODO Docker support
Got this adapted pretty easily. Just latest nvidia cuda container, throw python and modules on it and change server to serve on 0.0.0.0. Does mean it pulls the model every time on startup though which isn't ideal
Wow first time I have felt that this could be the end of voice acting/audio book narration etc. The speed with with the ways things are changing how soon before you can make any book any novel into a complete audio video / movie or tv show.
Impressive project! We'd love to use something like this over at Delfa (https://delfa.ai). How does this hold up from the perspective of stability? I've spoken to various folks working on voice models, and one thing that has consistently held Eleven Labs ahead of the pack from my experience is that their models seem to mostly avoid (while albeit not being immune to) accent shifts and distortions when confronted with unfamiliar medical terminology.
A high quality, affordable TTS model that can consistently nail medical terminology while maintaining an American accent has been frustratingly elusive.
Was this trained on Planet Money / NPR podcasts? The last audio (continuation of prompt) sounds eerily like Planet Money, I had to double check if my Spotify had accidentally started playing.
Hey, this is really cool! Curious how good the multi-language support is. Also - pretty wild that you trained the whole thing yourselves, especially without prior experience in speech models.
Might actually be helpful for others if you ever feel like documenting how you got started and what the process looked like. I’ve never worked with TTS models myself, and honestly wouldn’t know where to begin. Either way, awesome work. Big respect.
I've been waiting for this ever since reading some interview with Orson Scott Card ages ago. It turns out he thinks of his novels as radio theater, not books. Which is a very different way to experience the audio.
The full version of Dia requires around 10GB of VRAM to run.
If you have a 16gb of VRAM, I guess you could pair this with a 3B param model along side it, or really probably only 1B param with reasonable context window.
Training an audio model this good from 0 prior experience is really amazing. I would love to read a blog post about how you guys approached ramping up knowledge and getting practical quickly. Any plans?
The audio quality is seriously impressive. Any plans to add word-level timing maps? For my usecase that is a requirement, so unfortunately I cannot use this yet, but I would very much like to.
pretty cool - love seeing open stuff like this come together so fast. you think all this will ever match what a real voice actor can pull off, or something totally new comes out of it?
Seeing is no longer believing. Hearing isn't either. The funny thing is, it's getting to the point where LLM-generated text is more easily spotted than AI audio, video, and images.
It's going to be an interesting decade of the new equivalent of "No, Tiffany, Bill Gates will NOT be sending you $100 for forwarding that email." Except it's going to be AI celebrities making appeals for donations to help them become billionaires or something.
Looking forward to try. My current go-to solution is E5-F2 (great cloning, decent delivery, ok audio quality, a lot of incoherence here and there forcing you to do multiple generations).
I've just been massively disappointed by Sesame's CSM: on their gradio on the website it was generating flawless dialogs with amazing voice cloning. When running it local the voice cloning performance is awful.
Show HN: Dia, an open-weights TTS model for generating realistic dialogue
(github.com)642 points by toebee 21 April 2025 | 191 comments
Comments
https://i.horizon.pics/4sEVXh8GpI (27s)
It starts with an intro, too. Really strange
Insane how much low hanging fruit there is for Audio models right now. A team of two picking things up over a few months can build something that still competes with large players with tons of funding
I've tried "EPUB to audiobook" tools, but they are really miles behind what a real narrator accomplishes and make the audiobook impossible to engage with
> [S1] Oh fire! Oh my goodness! What's the procedure? What to we do people? The smoke could be coming through an air duct!
Seriously impressive. Wish I could direct link the audio.
Kudos to the Dia team.
Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.
It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.
Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-fir-7a4.notion.site/dia
We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.
So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.
Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.
We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.
https://github.com/nari-labs/dia/pull/4
> This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
> This project offers a high-fidelity speech generation model *intended solely for research and educational use*. The following uses are strictly forbidden:
> Identity Misuse: Do not produce audio resembling real individuals without permission.
> ...
Specifically the phrase "intended solely for research and educational use".
For anyone wanting a quick way to spin this up locally with a web UI and API access, I put together a FastAPI server wrapper around the model: https://github.com/devnen/Dia-TTS-Server
The setup is just a standard pip install -r requirements.txt (works on Linux/Windows). It pulls the model from HF automatically – defaulting to the faster BF16 safetensors (ttj/dia-1.6b-safetensors), but that's configurable in the .env. You get an OpenAI-compatible API endpoint (/v1/audio/speech) for easy integration, plus a custom one (/tts) to control all the Dia parameters. The web UI gives you a simple way to type text, adjust sliders, and test voice cloning. It'll use your CUDA GPU if you have one configured, otherwise, it runs on the CPU.
Might be a useful starting point or testing tool for someone. Feedback is welcome!
https://gitlab.gnome.org/GNOME/dia
> TODO Docker support
Got this adapted pretty easily. Just latest nvidia cuda container, throw python and modules on it and change server to serve on 0.0.0.0. Does mean it pulls the model every time on startup though which isn't ideal
A high quality, affordable TTS model that can consistently nail medical terminology while maintaining an American accent has been frustratingly elusive.
Might actually be helpful for others if you ever feel like documenting how you got started and what the process looked like. I’ve never worked with TTS models myself, and honestly wouldn’t know where to begin. Either way, awesome work. Big respect.
time to first audio is something that is crucial for us to reduce the latency - wondering if dia works with output streaming?
the python code snippet seems to imply that the entire audio bytes are generated directly?
I've been waiting for this ever since reading some interview with Orson Scott Card ages ago. It turns out he thinks of his novels as radio theater, not books. Which is a very different way to experience the audio.
Also, you don't need to explicitly create and activate a venv if you're using uv - it deals with that nonsense itself. Just `uv sync`.
I would absolutely love something like this for practicing Chinese, or even just adding Chinese dialogue to a project.
The full version of Dia requires around 10GB of VRAM to run.
If you have a 16gb of VRAM, I guess you could pair this with a 3B param model along side it, or really probably only 1B param with reasonable context window.
https://github.com/SparkAudio/Spark-TTS
It's going to be an interesting decade of the new equivalent of "No, Tiffany, Bill Gates will NOT be sending you $100 for forwarding that email." Except it's going to be AI celebrities making appeals for donations to help them become billionaires or something.
Sounds awesome in the demo page though.
I've just been massively disappointed by Sesame's CSM: on their gradio on the website it was generating flawless dialogs with amazing voice cloning. When running it local the voice cloning performance is awful.
What're the recommended GPU cloud providers for using such open-weights models?