SwanVoice: Zero-Shot Speech Synthesis for Long Monologue and Dialogue

Quick answer

SwanVoice synthesizes a full multi-turn dialogue of up to 4 speakers in a single generation pass, instead of voicing each turn separately and stitching the clips together. On the authors’ SwanBench-Speech evaluation it scores higher on expressive “richness” and prosodic “hierarchy” than every open-source baseline they tested, in both monologue and dialogue. The trade-off is blunt: word-level content accuracy is the system’s weakest point, so it can mispronounce or drop words even while the delivery sounds natural.

Why turn-by-turn TTS sounds wrong

Most zero-shot text-to-speech voices one utterance at a time. For a single narrator that is fine. For a conversation it breaks in three ways the paper names directly: the same speaker’s timbre drifts between turns (acoustic inconsistency), the back-and-forth loses its conversational rhythm (incoherence), and emotion resets at every turn boundary instead of building or cooling across the exchange (affective discontinuity). You hear it as two clips glued together rather than two people talking. SwanVoice’s core bet is that a dialogue has to be generated as one continuous object so the model can carry voice, timing, and mood across speaker turns.

How SwanVoice is built

Three pieces do the work. First, a 25 Hz variational autoencoder compresses speech into a low-frame-rate latent — fewer frames per second means long, multi-minute audio stays tractable to model. Second, the text side uses raw-text conditioning with explicit pause-aware symbols and pinyin substitution, so the model sees punctuation, breaks, and pronunciation hints rather than a stripped phoneme string. Third, the generator is a flow-matching Diffusion Transformer (DiT) with speaker-turn conditioning: the turn structure (who speaks when) is fed in so the model knows the conversational layout it is filling.

Training is staged rather than one-shot. The model starts on monologue speech, moves to mixed data, then to real dialogue recordings — a curriculum that teaches single-voice quality before conversational dynamics. To support it, the authors built SwanData-Speech, a paired corpus, and after the main training they run a DiffusionNFT post-training step driven by two rewards: a phone-level reward (push pronunciation toward the target) and a speaker-similarity reward (keep each voice on-identity). That post-training is the lever aimed squarely at the accuracy weakness.

Key results

SwanVoice beats all evaluated open-source baselines on richness and hierarchy scores in both the monologue and the dialogue settings of SwanBench-Speech — the metrics that capture expressiveness and prosodic structure, which is exactly what stitched turn-by-turn systems lose.
It handles 1 to 4 speakers in a single zero-shot generation, covering both solo narration and multi-party conversation without a separate pipeline per mode.
The 25 Hz latent frame rate is the enabling choice for long-form: a low frame rate is what keeps generating minutes of coherent audio feasible for a diffusion transformer.
Content accuracy is the named weak axis — the authors flag it as the main limitation, and the DiffusionNFT phone-level reward exists specifically to claw some of it back.

My honest read

The interesting claim here is not “better MOS” — it is reframing dialogue TTS as a single generation problem so cross-turn consistency is built in rather than patched. That is the right framing, and the expressiveness wins are believable. But the headline metrics are richness and hierarchy, which are softer than the one number that decides whether a TTS system is usable: did it say the right words. The paper is honest that accuracy lags, and that ordering tells you where this sits today — a system you would reach for when expressive, consistent delivery matters more than transcript-perfect fidelity, not the other way around. Anyone benchmarking against it should report intelligibility (WER) front and center, because that is where the pressure is.

Limits and open questions

Content accuracy is the explicit main limitation: expressive, consistent-sounding output that still mispronounces or drops words is a real failure mode for narration, audiobooks, or anything caption-aligned. The richness and hierarchy scores are proposed by the same work that introduces SwanBench-Speech, so they need independent corroboration before treating them as a standard. The cap of 4 speakers bounds how large a scene it can voice. And generating a whole dialogue in one pass — the source of its consistency advantage — raises questions the abstract leaves open about latency, memory, and how gracefully it streams for interactive use.

FAQ

What is SwanVoice and what makes it different from normal TTS?

SwanVoice is a zero-shot text-to-speech system that generates an entire monologue or 1-4 speaker dialogue in one pass. Unlike standard TTS that voices each turn separately and concatenates them, it keeps timbre, conversational rhythm, and emotion consistent across speaker turns.

How does SwanVoice generate multi-speaker dialogue?

It conditions a flow-matching Diffusion Transformer on the speaker-turn structure and generates the full conversation as one continuous latent at 25 Hz, so cross-turn acoustic and affective consistency is modeled directly rather than stitched after the fact.

Is SwanVoice accurate enough to use in production?

Its expressiveness leads open-source baselines, but content accuracy is the named main limitation — it can mispronounce or drop words. For accuracy-critical uses like audiobooks or captioned media, measure word error rate before relying on it.

What is SwanData-Speech and SwanBench-Speech?

SwanData-Speech is the paired training corpus the authors built to train SwanVoice, and SwanBench-Speech is their evaluation benchmark covering both monologue and dialogue, where they report the richness and hierarchy scores.

One line: generate the whole conversation at once so voices stay consistent — just verify it said the right words. Read the original paper on arXiv.