Google DeepMind, a publié 'Gemini 3.1 Flash TTS'… permettant d'ajuster le ton et la vitesse de la parole via du texte

robot
Création du résumé en cours

Google’s artificial intelligence organization DeepMind has unveiled a new speech synthesis model, “Gemini 3.1 Flash TTS.” Its core feature is not only to speak more naturally than existing mechanical voices but also to allow users to finely adjust tone, speed, and atmosphere through text instructions.

Controlling tone, intonation, and speed via text commands

Recently, Google LLC announced the launch of Gemini 3.1 Flash TTS via a blog post. This model can reflect directive words such as “enthusiastic,” “surprised,” or “informative” during the process of converting chatbot responses into speech, thereby changing intonation and timbre.

According to a publicly available demo video, users can not only choose voices but also adjust the delivery style and ambiance of the speech. If the previous generation TTS was somewhat “robotic,” this new model focuses on achieving a more human-like expressive capability.

Supporting accents from English regions to podcast formats

Gemini 3.1 Flash TTS also offers regional accents for several major languages. For example, in English, users can select American “Valley” and “Southern” accents, as well as British variants like “Brixton” and “RP.” There are also special accent options such as “Transatlantic.”

Google has added a “director-level control” feature to this model. Users can more precisely adjust speaking style and speed, and utilize templates for podcast dialogues, audiobook narration, language tutoring, voice assistants, health guides, news anchors, customer support agents, and more.

A particularly noteworthy feature is that when users set scenes and environments or input dialogue guidance, the model is designed to enable characters to have multiple conversations while maintaining a consistent speaking style. Google explains that completed settings can be exported as Gemini API code, allowing the same voice to be reproduced across multiple projects and platforms.

Supporting over 70 languages… and applying watermarks

According to Google, Gemini 3.1 Flash TTS aims to provide a more natural speech experience. It supports over 70 languages, including Japanese, Hindi, German, and others.

Additionally, all outputs are embedded with a SynthID watermark. This measure is intended to facilitate the identification of AI-generated speech content, addressing concerns about deepfake or misinformation dissemination in the future.

Ranked second in blind tests… developers can use it immediately

Its performance has also been validated to some extent. In the “Artificial Analysis TTS Ranking,” which reflects thousands of blind preference tests by humans, Gemini 3.1 Flash TTS scored 1211 points, ranking second overall. Google states this indicates it received higher evaluations than several popular TTS models.

Currently, developers can immediately access this model via the Gemini API and Google AI Studio. Enterprise clients can access it through Vertex AI, while ordinary users can try out this feature within Google Biz.

This release indicates that the competition in generative AI is rapidly expanding from text and images into the speech domain. Especially as demand for “natural AI voices” grows in enterprise support, media production, education, and digital content markets, Gemini 3.1 Flash TTS is likely to further intensify competition in these markets.

TP AI注意事项 使用基于TokenPost.ai的语言模型对文章进行了摘要。正文的主要内容可能被省略或与事实不符。

Voir l'original
Cette page peut inclure du contenu de tiers fourni à des fins d'information uniquement. Gate ne garantit ni l'exactitude ni la validité de ces contenus, n’endosse pas les opinions exprimées, et ne fournit aucun conseil financier ou professionnel à travers ces informations. Voir la section Avertissement pour plus de détails.
  • Récompense
  • Commentaire
  • Reposter
  • Partager
Commentaire
Ajouter un commentaire
Ajouter un commentaire
Aucun commentaire
  • Épingler