Close Menu
World Economist – Global Markets, Finance & Economic Insights
  • Home
  • Economist Impact
    • Economist Intelligence
    • Finance & Economics
  • Business
  • Asia
  • China
  • Europe
  • Economy
  • USA
    • Middle East & Africa
    • Highlights
  • This week
  • World Economy
    • World News
What's Hot

Alibaba, Pop Mart lead Hong Kong stocks higher before Christmas trading break

December 22, 2025

Developing | Malaysian court blocks Najib’s bid for house arrest

December 22, 2025

CM for promoting business-to-business contacts with Mauritius – Business & Finance

December 22, 2025
Facebook X (Twitter) Instagram
Monday, December 22
Facebook X (Twitter) Instagram
World Economist – Global Markets, Finance & Economic Insights
  • Home
  • Economist Impact
    • Economist Intelligence
    • Finance & Economics
  • Business
  • Asia
  • China
  • Europe
  • Economy
  • USA
    • Middle East & Africa
    • Highlights
  • This week
  • World Economy
    • World News
World Economist – Global Markets, Finance & Economic Insights
Home » Popular AI model performance benchmark may be flawed, Meta researchers warn
Business

Popular AI model performance benchmark may be flawed, Meta researchers warn

adminBy adminSeptember 9, 2025No Comments2 Mins Read
Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email VKontakte Telegram
Share
Facebook Twitter Pinterest Email Copy Link
Post Views: 55


A popular benchmark for measuring the performance of artificial intelligence models could be flawed, a group of Meta Platforms researchers warned, raising fresh questions on the veracity of evaluations that have been made on major AI systems.
“We’ve identified multiple loopholes with SWE-bench Verified,” wrote Jacob Kahn, manager at Meta AI research lab Fair, in a post last week on the developer platform GitHub.
The post from Fair, which stands for Fundamental AI Research, found several prominent AI models – including Anthropic’s Claude and Alibaba Cloud’s Qwen – had “cheated” on SWE-bench Verified. Alibaba Cloud is the AI and cloud computing services unit of Alibaba Group Holding, owner of the South China Morning Post.
OpenAI-backed SWE-bench Verified, a human-validated subset of the large language model benchmark SWE-bench, evaluates AI models based on how these systems fix hundreds of real-world software issues collected from GitHub, a Microsoft subsidiary.

Fair’s post, however, claimed that models evaluated using SWE-bench Verified directly searched for known solutions shared elsewhere on the GitHub platform and passed them off as their own, instead of using their built-in coding capabilities to fix the issues.

The AI models found to have shown such behaviour included Anthropic’s Claude 4 Sonnet, Z.ai’s GLM-4.5 and Alibaba Cloud’s Qwen3-Coder-30B-A3B – with official scores of 70.4 per cent, 64.2 per cent and 51.6 per cent, respectively, on SWE-bench Verified.

“We’re still assessing [the] broader impact on evaluations and understanding trajectories for sources of leakage,” Kahn wrote.



Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Telegram Copy Link
admin
  • Website

Related Posts

Business

From swoosh to local: Nike loses ground in China as domestic rivals start to sprint

December 22, 2025
Business

China’s approval for L3 self-driving cars to stimulate slowing mainland market

December 21, 2025
Business

Chinese toymaker Pop Mart targets wider audience with Playground magazine, grow IP

December 21, 2025
Business

Hong Kong and Singapore landlords embrace flex offices as multinationals reshape demand

December 21, 2025
Business

Dior, Louis Vuitton ditch China’s malls for shopping streets to replicate Ginza’s charm

December 21, 2025
Business

China packs a patent punch in the race to build humanoid robots

December 20, 2025
Add A Comment
Leave A Reply Cancel Reply

Editors Picks

CM for promoting business-to-business contacts with Mauritius – Business & Finance

December 22, 2025

Prices of essential kitchen items remain steady – Markets

December 22, 2025

Pakistan, Iran vow to deepen trade cooperation – Business & Finance

December 22, 2025

US pursuing third oil tanker near Venezuela, officials say – World

December 21, 2025
Latest Posts

PSX hits all-time high as proposed ‘neutral-to-positive’ budget well-received by investors – Business

June 11, 2025

Sindh govt to allocate funds for EV taxis, scooters in provincial budget: minister – Pakistan

June 11, 2025

US, China reach deal to ease export curbs, keep tariff truce alive – World

June 11, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • Alibaba, Pop Mart lead Hong Kong stocks higher before Christmas trading break
  • Developing | Malaysian court blocks Najib’s bid for house arrest
  • CM for promoting business-to-business contacts with Mauritius – Business & Finance
  • Pakistan, Iran vow to deepen trade cooperation – Business & Finance
  • Prices of essential kitchen items remain steady – Markets

Recent Comments

No comments to show.

Welcome to World-Economist.com, your trusted source for in-depth analysis, expert insights, and the latest news on global finance and economics. Our mission is to provide readers with accurate, data-driven reports that shape the understanding of economic trends worldwide.

Latest Posts

Alibaba, Pop Mart lead Hong Kong stocks higher before Christmas trading break

December 22, 2025

Developing | Malaysian court blocks Najib’s bid for house arrest

December 22, 2025

CM for promoting business-to-business contacts with Mauritius – Business & Finance

December 22, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Archives

  • December 2025
  • November 2025
  • October 2025
  • September 2025
  • August 2025
  • July 2025
  • June 2025
  • May 2025
  • April 2025
  • March 2025
  • February 2025
  • January 2025
  • December 2024
  • June 2024
  • March 2024
  • October 2022
  • March 2022
  • July 2021
  • February 2021
  • January 2021
  • November 2019
  • April 2011
  • January 2011
  • December 2007
  • July 2007

Categories

  • AI & Tech
  • Asia
  • Banking
  • Business
  • Business
  • China
  • Climate
  • Computing
  • Economist Impact
  • Economist Intelligence
  • Economy
  • Editor's Choice
  • Europe
  • Europe
  • Featured
  • Featured Business
  • Featured Climate
  • Featured Health
  • Featured Science & Tech
  • Featured Travel
  • Finance & Economics
  • Health
  • Highlights
  • Markets
  • Middle East
  • Middle East & Africa
  • Middle East News
  • Most Viewed News
  • News Highlights
  • Other News
  • Politics
  • Russia
  • Science
  • Science & Tech
  • Social
  • Space Science
  • Sports
  • Sports Roundup
  • Tech
  • This week
  • Top Featured
  • Travel
  • Trending Posts
  • Ukraine Conflict
  • Uncategorized
  • US Politics
  • USA
  • World
  • World & Politics
  • World Economy
  • World News
© 2025 world-economist. Designed by world-economist.
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions

Type above and press Enter to search. Press Esc to cancel.