Voxtral: Open-source TTS blind test beats ElevenLabs, runs on laptops

robot
Abstract generation in progress

Title

Mistral’s Voxtral: Blind Test Beats ElevenLabs, Can Run Locally.

Summary

Rohan Paul noticed a set of comparative data: in a blind test of multilingual voice cloning, reviewers chose Mistral’s newly released Voxtral 70% of the time for naturalness, accent reproduction, and similarity. With 4 billion parameters, it can clone voice tones from 3 seconds of reference audio, supports 9 languages, and has a 70ms latency on laptops. The open-source weights mean businesses can run it themselves without paying per API call.

Key Points

  • 70% Preference Rate: Blind tests conducted by native speakers in 9 languages, evaluating naturalness, accent accuracy, and similarity to original voice.
  • Who It Beats: Outperformed ElevenLabs Flash v2.5, tied with v3.
  • Technical Features: Transformer architecture that captures speech habits like pauses and intonation more finely; open-source weights can run locally, saving API costs and avoiding vendor lock-in.
  • Licensing Issues: The model itself can be used commercially, but the reference voices are CC BY-NC. It’s legally unclear whether using others’ voices for products is permissible.

Why This Time Is Different

  • Cost and Control
    • ElevenLabs: Charges by character, using their servers and closed-source API.
    • Voxtral: Download weights to run locally, no per-use fees, full control over the entire process.
  • What It Can Do
    • Scenarios like voice agents, simultaneous interpretation, and dubbing; open-source weights make experimentation and scaling cost-effective, and privacy compliance is easier to handle.

Quick Comparison

Dimension Voxtral ElevenLabs
Model Access Open-source weights, can run locally Closed-source API
Latency Approximately 70ms on laptops Depends on cloud and package
Languages 9 Multilingual (not detailed in this article)
Voice Cloning 3 seconds of reference audio Supported (not expanded in this article)
Evaluation 70% preference in blind test Flash v2.5 lost, v3 is similar
Commercial Restrictions Reference voices CC BY-NC Platform licensing and billing restrictions

For evaluation methods and details, refer to Mistral’s blog, documentation, and Hugging Face repository.

Industry Background

This release once again brings up the old topic of open-source vs. closed-source. Mistral is transitioning from language models to voice, advancing its multimodal layout. There is a need for stable, controllable, and cost-predictable voice applications, and open-source weights + self-deployment have found a balance between cost, performance, and compliance.

Risks

  • Licensing Uncertainty: Reference voices are CC BY-NC; it’s still unclear how copyright and portrait rights work for directly cloning others’ voices in commercial products.
  • Limited Comparison Scope: Only compared with ElevenLabs, without testing other open-source TTS like Coqui or Bark.

Impact Assessment

  • Importance: High
  • Category: Model release, open-source, market impact

Judgment: Teams needing controllable voice links and predictable costs, it’s not too late to enter now. Developers and enterprise Builders have a clear advantage; those solely focused on transactions are less affected.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin