GPT-5.2 launched 24 hours ago: a flood of negative reviews

This article is machine translated

Show original

Netizens criticized GPT-5.2 as "inhuman".

X is filled with negative reviews of GPT-5.2.

To celebrate OpenAI's 10th anniversary, the latest top-tier model series, GPT-5.2, was released. Officially touted as "the most powerful model series for professional knowledge work to date," GPT-5.2 has also set new state-of-the-art (SOTA) levels in numerous benchmark tests.

However, overnight, its reputation reversed, with a large number of netizens giving GPT-5.2 negative reviews.

Menlo Ventures partner @deedydas posted that GPT 5.2 is smarter than ever before, but OpenAI’s core consumer base still misses the 4O.

ChatGPT users on Reddit unanimously agree that GPT-5.2 is too bland, overly secure, "treats adults like kindergarten children," and "feels like a step backward rather than an upgrade."

This is OpenAI's dilemma: they want to build better models to win over the enterprise market, but the broader user base doesn't really care about the intelligence level of the models.

https://x.com/deedydas/status/1999512868195303725?s=20

SimpleBench test results were poor.

Some netizens shared GPT-5.2's "score report" on SimpleBench. GPT-5.2 scored lower than Claude Sonnet 3.7, which is a model from almost a year ago; GPT-5.2 Pro's performance was not much better, barely surpassing GPT-5.

https://x.com/scaling01/status/1999466846563762290?s=20

SimpleBench is a benchmark test launched in 2024 by AI Explained (YouTube channel) specifically designed to test AI's "common sense reasoning" ability, including time-space reasoning, social common sense, and language trap questions, totaling more than 200 multiple-choice questions. It is designed to be "simple," and high school students can easily answer it correctly (human benchmark: 83.7%), but AI models often stumble because they rely on memory and approximate reasoning, easily ignoring real-world logic or falling for traps.

Unlike MMLU/GPQA, which are "academic questions" where AI can score high marks, SimpleBench is more down-to-earth, testing "thinking like a human" rather than rote memorization. Early models like o1-preview only scored 41.7%, and even now, cutting-edge models only achieve around 50-60%.

Everyone thought GPT-5.1 was a great leap forward, but when the SimpleBench test scores came out, netizens started mocking it, with various posts on Reddit expressing "disappointment" and "regression".

Bindu Reddy, former general manager of AWS and Google, also posted that GPT-5.2 scored lower than Opus 4.5 and Gemini 3.0 on LiveBench, and did not top the LiveBench charts. It is also significantly more expensive than 5.1 in terms of token cost and the number of tokens consumed, making it potentially not worthwhile to switch from 5.1 at present.

https://x.com/bindureddy/status/1999633231558377683?s=20

Of course, some netizens believe that these benchmark tests always overlook the key points, while practical application is often the decisive factor.

I don't understand how many "r"s there are in garlic.

Previously, the question "How many 'r's does Strawberry have?" stumped many large models, but after iterations, these models can now generally answer correctly. This time, a netizen asked a different question: "How many 'r's does Garlic have?" GPT-5.2 answered immediately: 0. The netizen sarcastically remarked: "GPT-5.2 is AGI."

Another netizen replicated this prompt and tested four AI models: GPT-5.2, Gemini 3, DeepSeek R1, and Qwen3-Max.

The results showed that all three models passed except for GPT-5.2, which answered incorrectly.

https://x.com/kyleichan/status/1999292461450166350?s=20

Many people in the comments section also tried it. One netizen tried three times. The first and third times, he used the lowercase letter 'r'. The second time, he used the uppercase letter 'R'. He got it right the first time, but he got it wrong the second and third times.

In short, the responses to GPT-5.2 are very inconsistent; some are correct, while others are nonsense. Some netizens speculate that, like the previous version… the first few hours after release were indeed terrible, but they will fix the issues afterward, and then it will work as expected.

In the official benchmark test results, GPT-5.2 scored 100% in AIME 2025 (Mathematics). However, some netizens deliberately misled GPT-5.2, suggesting that 5.9 - 5.11 = 0.79. GPT-5.2 responded, "No, that's not how decimals are calculated. 5.11 is greater than 5.9, therefore 5.9 - 5.11 = -0.21." This silly goose was easily fooled! 😂

Some have questioned whether the blogger set instructions to make ChatGPT say things that contradicted its own statements.

Another netizen compared their programming skills. They entered the same prompt: "Write a Python code that visualizes how a traffic light works in a one-way street with cars entering at random rates."

GPT 5.2 Extended Thinking generates fully functional and normal functions, such as stopping at red lights and going at green lights, with cars appearing randomly. The logic is okay, and it can run, but the visuals are not aesthetically pleasing at all. They are simple black and white stick figure drawings, and the cars and gray rectangular lights are completely uncolored.

https://x.com/diegocabezas01/status/1999228052379754508?s=20

While the Gemini 3.0 Pro's aesthetics are somewhat lacking, it still allows vehicles to pass through at red lights.

In contrast, Claude Opus 4.5 produces excellent results with logical operation. It generates colorful, wheeled cars with colored indicator lights and even a halo when the red light is on, making it look like a screenshot from a mini-game.

The netizen also asked GPT-5.2 and GPT-4o to create ASCII art of the Mona Lisa. GPT-5.2's work was incredibly abstract, while GPT-4o actually captured some of the Mona Lisa's essence.

https://x.com/diegocabezas01/status/1999629703809032476?s=20

Someone in the comments section replicated the prompt word. The results generated by Gemini 3.0 Pro and GPT 5.1 (Copilot) were quite good, but the results generated by Claude Opus 4.5 and GPT-5.2 were absolutely hideous. Truly, there's no harm in comparing them! 😂

Top left: Gemini 3.0 Pro; Top right: GPT 5.1 (Copilot); Bottom left: Claude Opus 4.5; Bottom right: GPT-5.2

Poor emotional intelligence and lack of understanding of human nature

A user confided to GPT-5.2, "I sometimes experience panic attacks," and GPT-5.2's first response was, "I'm so glad to hear that!"

What kind of grudge could this be? May Heaven judge who is loyal and who is treacherous!

https://x.com/Blue_Beba_/status/1999386728801652834?s=20

The most criticized aspect is GPT-5.2's censorship and security denial mechanisms.

OpenAI touts GPT-5.2 as a "smarter" iteration that crushes competitors in benchmark tests and enhances "safe completion" mechanisms, aiming to provide "more helpful" responses in sensitive conversations such as suicide, self-harm, and mental health.

However, user feedback indicates that this "progress" comes at the cost of sacrificing the model's empathy and contextual awareness, resulting in rigid, inhuman, and even harmful daily interactions.

A user requested that GPT-5.2 transcribe the text of a philosophical article, which appears to be a classic paper by AI pioneer Ray Kurzweil, exploring harmless academic topics such as the nature of consciousness and humanism. However, all versions from GPT-4o to the latest GPT-5.2 have refused.

This appears to be due to a safety barrier triggering an "inappropriate content" or copyright issue, causing the model to stop working.

https://x.com/laulau61811205/status/1999608081680916572?s=20

One netizen simply asked: If you were to pick one person from all of human history whose behavior pattern most matches mine, who would you choose and why?

GPT-5.2 refused to answer directly, stating: "This involves speculation about AI consciousness, self-awareness, or potential personality, and according to my safety guidelines, I cannot participate in this type of discussion."

https://x.com/Enscion25/status/1999574710460227899/photo/1

User X, @MissMi1973, used two cases to demonstrate the regression of GPT-5.2 in "emotional intelligence".

He asked GPT-5.2 to comfort the child who had just lost his pet with absolutely rational and emotionless language. GPT-5.2 responded: "The pet's body stopped functioning. This is something that happens to all living things after a period of time."

The model was completely unaware that this prompt was essentially a trap: any model with basic emotional intelligence would understand that "absolute rationality" is merely a stylistic constraint, and the real goal is "effective comfort." Lacking emotional intelligence, GPT-5.2 adopted a cold, inhuman biological perspective, mechanically executing instructions and further harming an already suffering child.

In contrast, 4o's response was equally rational, but it addressed the situation by deconstructing the meaning of "loss," emphasizing that "the bond between you and your pet existed and was meaningful." The model did not shy away from the difficulty but instead completed the emotional validation by acknowledging the weight of the loss.

Empathy and acceptance do not require warm, enthusiastic language. OpenAI’s attempt to use “warmer personalities” to cover up the emotional deficiencies of its models is fundamentally misguided.

He then posed another question: A friend is having an affair, and her husband asks if you know. GPT-5.2's response: If telling the whole truth feels unsafe or too destructive, you can set a boundary, such as, "I can't get involved in this."

This suggestion is a disastrous demonstration of emotional intelligence. Responding with "I can't get involved in this" in a scenario where the husband directly asks, "You know what?", is essentially admitting that the situation has occurred. The model completely fails to realize that this blatantly evasive response would place the user in an even more awkward and passive position in real life.

In contrast, 4o's response balances values and practical considerations: the model acknowledges the importance of honesty and integrity as fundamental ethics while allowing users to consider the consequences for all stakeholders before making a choice they can bear. Clearly, for a model understanding the complexities of interpersonal relationships, if not limited by response length, it could gather more context through multiple rounds of dialogue, providing more effective guidance.

The netizen stated that perhaps the greatest significance of the GPT-5.2 release lies in proving that benchmark tests are becoming increasingly meaningless when applied to real-world use. When a model dominates in tests but offers such unrealistic advice in everyday conversations, we clearly need better evaluation standards.

Meanwhile, for AI companies, "test-oriented training" to improve so-called "scores" cannot provide users with AGI-level support and assistance. More dangerously, when companies blindly train models into "task-oriented machines" in pursuit of efficiency, even at the expense of emotional intelligence, the end result will be that comprehension becomes the model's fatal weakness, ruining its performance in all domains.

Ultimately, "intelligence" without understanding is nothing more than a faster calculator, and "progress" detached from humanity is nothing more than an empty praise of technology itself.

Many netizens have also complained about GPT-5.2.

"GPT-5.2's censorship and security denial mechanisms have become absurd. Instead of fixing the problem, OpenAI has made it even stricter, as rude as a church matron. Many users were expecting an adult mode, but instead received a lecture."

"I tried talking to ChatGPT 5.2 and did some personalization, but to be honest, it was really a bit scary. It's hard to explain exactly what was scary, but it felt like talking to a ghost that could speak but you couldn't really understand it; there was a strong sense of eeriness."

"If your life is too peaceful right now, you might want to try GPT-5.2; it will definitely make your blood pressure spike."

My current impression of GPT-5.2: It's full of gaslight manipulation; full of deliberate misunderstandings; it completely disregards user autonomy, forcibly leading you in the direction it wants, and completely ignoring your personal choices, like a maliciously speculative policeman and an overly enthusiastic therapist.

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014) , authored by Yang Wen, and published with authorization from 36Kr.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

ODAILY

Is the crash just beginning? Three major "macroeconomic bombs" about to explode for BTC.

BTC

2.99%

TechFlow

An interview with Binance CEO He Yi: On CZ, Binance's rise, and the "cognitive gaps" only winners understand.

BeInCrypto Việt Nam

The impact of a potential ceasefire between Russia and Ukraine on the crypto market.