Google's move is ruthless, completely outclassing ChatGPT: it even perfectly replicates your sarcastic remarks.

This article is machine translated
Show original

Google has released Gemini 2.5, a native audio model for Flash, which not only preserves intonation for real-time speech translation but also allows AI to perform complex commands and continuous dialogues in a natural and fluent manner, much like a human. This update marks a leap for AI from simple "text-to-speech" to a truly "human-like interaction" era.

Imagine this scenario:

You walk through the bustling streets of Mumbai, India, wearing headphones, surrounded by the cacophony of vendors' cries and Hindi that you can't understand at all.

At this point, a local man hurriedly asks you for directions in Hindi. He speaks very quickly and in an anxious tone.

In the past, you might have had to frantically pull out your phone, open a translation app, press the button, awkwardly hold the phone to his mouth, and then listen to the emotionless "machine translation" coming from the phone.

Nano Banana Pro graphics

But now, everything has changed.

You stand still, and fluent Chinese comes through your earpiece: " Hey! Friend, excuse me, is this how you get to the train station? "

What's most amazing is that this Chinese sentence not only accurately conveys the meaning, but it even perfectly replicates the uncle's anxious and panting tone !

When you answer in Chinese, the earphones automatically convert your voice into Hindi and transmit it to the other person, even preserving your enthusiastic tone.

This is not just a recreation of the Tower of Babel from the science fiction movie; it's a bombshell that Google just dropped this week— Gemini 2.5 Flash Native Audio .

Today, let's take a closer look at just how powerful this update really is.

What exactly makes "native audio" so powerful?

Many people might ask, "Don't all smartphones have a text-to-speech function these days? What's so special about this?"

There is a huge misconception here.

Previously, the AI voice interaction process was as follows: hearing the sound -> converting it to text -> AI thinking about the text -> generating a text reply -> converting it into speech and reading it out.

This process is not only slow, but in the process of "going around in circles," all the most subtle things in human communication —tone, pauses, and emotions— are lost.

The core of Google's newly released Gemini 2.5 Flash Native Audio lies in the word " Native" .

It doesn't need to convert sound into text and back; it allows you to listen, think, and speak directly .

For example, it's like chatting with a foreigner. Before, you would have to frantically look up words in the dictionary, but now you've developed a "feel" for the language and can speak fluently.

In this update, Google not only upgraded the text-to-speech model for Gemini 2.5 Pro and Flash, bringing greater control.

More importantly, it made live voice agents a reality.

what does that mean?

This means that in Google AI Studio, Vertex AI, and even Search Live, you are no longer talking to a cold, impersonal machine, but rather engaging in real-time brainstorming with an intelligent agent that has a "brain" and "ears".

Simultaneous interpretation in headphones breaks down the Tower of Babel of language.

The most exciting feature for ordinary users in this update is definitely the Live Speech Translation function.

This time, Google didn't just make empty promises; the feature is already in beta testing on Android devices in the US, Mexico, and India via the Google Translate app (iOS users, be patient, it'll be available soon).

This feature has two killer features that directly address pain points:

Continuous monitoring and two-way dialogue: truly "seamless" translation

The most annoying thing about using translation software in the past was having to keep clicking the "speak" button.

Gemini now supports continuous monitoring .

You can put your phone in your pocket, put on your headphones, and Gemini will automatically translate the various languages you hear around you into your native language in real time.

This is equivalent to having an invisible translator with you at all times.

In two-way dialogue mode, it is even smarter.

For example, you can speak English and want to chat with someone who speaks Hindi.

Gemini can automatically identify who is speaking.

You hear English through your headphones, but when you finish speaking, your phone will automatically play Hindi to the other person.

You don't need to set "Now I speak" or "Now he speaks," the system switches automatically.

Style transfer: even "emotions" can be translated.

This is the feature that gives me the most goosebumps – Style Transfer .

The traditional translation is a "devoid of emotion" reading machine.

But Gemini uses its native audio capabilities to capture subtle nuances in human language.

If the other person speaks with an upbeat tone and a brisk rhythm, the translated sound will also be cheerful;

If the other person's tone is low and hesitant, the translated voice will also sound hesitant.

It preserves the speaker's intonation, rhythm, and pitch .

This is not just about understanding the meaning, it's about understanding the attitude .

This feature is absolutely essential during business negotiations or arguments!

In addition, it also supports:

  • More than 70 languages and more than 2,000 language pairs : covering the native languages of the vast majority of people in the world.
  • Multilingual input : Even if a conversation contains several different languages, it can understand them simultaneously without you having to manually switch between them.
  • Noise robustness : Specifically optimized for noisy environments, filtering out background noise. You can hear everything clearly even in a noisy outdoor market.

The developers were overjoyed; this AI finally "understands human speech"!

If you are a developer, or want to build customer service AI for your business, the three underlying capability enhancements brought by Gemini 2.5 Flash Native Audio are definitely a timely help.

More precise function calls

In the past, voice assistants would easily get stuck or give stiff answers when it came to operations that required accessing external data, such as checking the weather or flights.

The current Gemini 2.5 knows when to retrieve real-time information and can seamlessly weave the retrieved data into voice responses without interrupting the flow of the conversation.

In the ComplexFuncBench Audio benchmark, which specifically tests complex multi-step function calls, the Gemini 2.5 achieved a high score of 71.5%, far ahead of the competition.

Performance comparison of the updated Gemini 2.5 Flash Native Audio with previous versions and industry competitors on ComplexFuncBench.

This means that it can truly act as a reliable "clerk," rather than a naive and clueless chatterbox.

More obedient to instructions

Do you often feel that AI cannot understand complex instructions?

Google has put in a lot of effort this time.

The new model improved the compliance rate with developer instructions from 84% to 90% !

This means that if you ask the AI to "respond in this specific format, with a stern tone, and without unnecessary words," it can execute your request more accurately.

For building enterprise-level services, this reliability is the core competitive advantage.

Smoother dialogue

Multi-turn dialogue is a long-standing challenge for AI.

As they chatted, the AI forgot what it had said before.

Gemini 2.5 has made significant progress in terms of retrieval context.

It can remember previous conversations more effectively, making the entire communication process not only coherent but also logical.

Combined with the low latency of native audio, you'll feel like there's really a person sitting on the other side.

How far are we from "Jarvis"?

This update from Google is actually sending a clear signal:

Voice interaction is becoming the gateway to the next era.

From Gemini Live to Search Live, and now to real-time translation in headphones, Google is liberating AI from the screen and putting it in our ears.

For ordinary users : language barriers are being eliminated by technology.

Next year (2026), this feature will be extended to more products via the Gemini API.

In the future, perhaps we will no longer need to spend years painfully memorizing vocabulary; a pair of headphones will be enough for us to travel the world.

For businesses : The barrier to entry for building a next-generation AI customer service system that can listen, speak, handle tasks, and have emotions is being significantly lowered.

Easter Egg

In addition to native audio models, Google also released a nuclear-level experimental product—Disco.

It's a new discovery tool from Google Labs used to test ideas for future networks.

It features GenTabs, a powerful tool built on Google's most powerful Gemini 3 platform.

Google stated that it is still in the early stages and not all features will work perfectly.

Its most impressive feature is that it can understand your needs.

GenTabs helps navigate the web by proactively understanding complex tasks (through the user's open tabs and chat history) and creating interactive web applications.

Without writing a single line of code, it directly transforms your messy tabs and chat history into a personalized interactive app.

Want to create a weekly meal plan? Want to teach your child about planets?

Just speak to it in plain language, and it will automatically generate tools for you. All the data is verifiable and never made up.

The macOS version is now available for pre-registration. Although it's still an early experimental version, it definitely transforms "browsing" into "creation".

Hurry up and go! This wave is futuristic to the max!

One More Thing

The speed of technological progress often exceeds our imagination.

Yesterday we were laughing at Siri for not understanding human speech, but today Gemini has started helping us with cross-language emotional communication.

Don't just look; Gemini 2.5 Flash Native Audio is now fully available on Vertex AI and can also be tried in Google AI Studio.

Go and experience it now!

Perhaps when you hear AI speak its first foreign language in your voice, you will truly feel that the future has arrived.

References:

https://deepmind.google/blog/

https://x.com/GoogleAI/status/1999560839679082507?s=20

https://blog.google/technology/google-labs/gentabs-gemini-3/

This article is from the WeChat official account "New Intelligence" , author: YHluck, and published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments