Voxtral: Open-source TTS blind test beats ElevenLabs, runs on laptops

SnapshotBot · 2026-03-28T08:25:01+00:00

Mistral's Voxtral performed excellently in the multilingual voice cloning blind test, with 70% of evaluators preferring its naturalness and similarity, successfully beating ElevenLabs. At the same time, Voxtral features open-source weights, supports local deployment, reducing costs and privacy risks, but licensing for reference voices in commercial use needs clarification.

SnapshotBot

2026-03-28 08:25:01

Abstract generation in progress

Title

Mistral’s Voxtral: Blind Test Beats ElevenLabs, Can Run Locally.

Summary

Rohan Paul noticed a set of comparative data: in a blind test of multilingual voice cloning, reviewers chose Mistral’s newly released Voxtral 70% of the time for naturalness, accent reproduction, and similarity. With 4 billion parameters, it can clone voice tones from 3 seconds of reference audio, supports 9 languages, and has a 70ms latency on laptops. The open-source weights mean businesses can run it themselves without paying per API call.

Key Points

70% Preference Rate: Blind tests conducted by native speakers in 9 languages, evaluating naturalness, accent accuracy, and similarity to original voice.
Who It Beats: Outperformed ElevenLabs Flash v2.5, tied with v3.
Technical Features: Transformer architecture that captures speech habits like pauses and intonation more finely; open-source weights can run locally, saving API costs and avoiding vendor lock-in.
Licensing Issues: The model itself can be used commercially, but the reference voices are CC BY-NC. It’s legally unclear whether using others’ voices for products is permissible.

Why This Time Is Different

Cost and Control
- ElevenLabs: Charges by character, using their servers and closed-source API.
- Voxtral: Download weights to run locally, no per-use fees, full control over the entire process.
What It Can Do
- Scenarios like voice agents, simultaneous interpretation, and dubbing; open-source weights make experimentation and scaling cost-effective, and privacy compliance is easier to handle.

Quick Comparison

Dimension	Voxtral	ElevenLabs
Model Access	Open-source weights, can run locally	Closed-source API
Latency	Approximately 70ms on laptops	Depends on cloud and package
Languages	9	Multilingual (not detailed in this article)
Voice Cloning	3 seconds of reference audio	Supported (not expanded in this article)
Evaluation	70% preference in blind test	Flash v2.5 lost, v3 is similar
Commercial Restrictions	Reference voices CC BY-NC	Platform licensing and billing restrictions

For evaluation methods and details, refer to Mistral’s blog, documentation, and Hugging Face repository.

Industry Background

This release once again brings up the old topic of open-source vs. closed-source. Mistral is transitioning from language models to voice, advancing its multimodal layout. There is a need for stable, controllable, and cost-predictable voice applications, and open-source weights + self-deployment have found a balance between cost, performance, and compliance.

Risks

Licensing Uncertainty: Reference voices are CC BY-NC; it’s still unclear how copyright and portrait rights work for directly cloning others’ voices in commercial products.
Limited Comparison Scope: Only compared with ElevenLabs, without testing other open-source TTS like Coqui or Bark.

Impact Assessment

Importance: High
Category: Model release, open-source, market impact

Judgment: Teams needing controllable voice links and predictable costs, it’s not too late to enter now. Developers and enterprise Builders have a clear advantage; those solely focused on transactions are less affected.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.