DeepSeek-R1 Receives Major Upgrade, Performance Rivals Claude 4 and o3 High

06/03 2025 397

Preface:

Following the footsteps of the previous V3-0324 model, this latest iteration is another [minor version upgrade] built upon the existing foundation, propelling DeepSeek-R1 back to the forefront of inference models.

Author | Fang Wensan

Image Source | Network

DeepSeek-R1 Upgraded, Performance Now Rivals Claude 4

DeepSeek has recently announced a minor version upgrade for its R1 series of inference models. The latest release, DeepSeek-R1-0528, boasts a parameter count of up to 685 billion, marking significant enhancements in depth of thought and reasoning capabilities.

DeepSeek has unveiled the specific scores of R1-0528 across various benchmark evaluations.

R1-0528 excels in multiple benchmark evaluations, including mathematics, programming, and general logic, with its overall performance now closely aligned with o3 and Gemini-2.5-Pro.

A key feature of this update is the substantial expansion of the context window. Compared to the previous R1 version, the API documentation annotation capacity has been doubled from 64K to 128K, with the [0528] version fully leveraging this expansion in practical tests.

According to the latest LiveCodeBench rankings, R1-0528 stands second only to OpenAI's o3 and o4 mini, surpassing xAI Grok 3 mini and Ali Qwen 3. The generated webpages and interactive interfaces are aesthetically pleasing, and execution efficiency is notably higher.

On LiveCodeBench, DeepSeek-R1-0528's performance is on par with OpenAI's top models, outperforming Claude 3.5 Sonnet and Qwen3-235B, and trailing closely behind OpenAI's O4-Mini (Medium configuration).

Community evaluations indicate that the new 0528 model has significantly improved in [language naturalness] and [dialogue logic], no longer exhibiting the earlier models' [unrestrained] narrative style.

Furthermore, R1-0528 has demonstrated improvements over its predecessor R1 in the Thematic Generalization Benchmark test.

This benchmark assesses how effectively various LLMs can infer a narrow or specific [theme] (category/rule) from a limited number of examples and counterexamples, and then identify the item that truly fits the theme amidst misleading candidates. The benchmark process involves generating themes, creating examples and counterexamples, filtering low-quality data through a [double-check] step, and prompting Master of Laws (LLM) to score real examples among multiple interference items, with a lower value indicating superior performance. R1-0528's performance is comparable to Claude-4-Sonnet Thinking 64K and Gemini 2.5 Pro.

The new model upgrade supports a 128K ultra-large context window, offering a broader scope for handling complex tasks. Compared to its predecessor, R1-0528 excels in text recall tests with a 32K context window, showcasing a substantial increase in accuracy, particularly suitable for scenarios requiring deep understanding and precise answers.

Emerging as a Strong Competitor in the Open-Source Model Landscape

In the Extended NYT Connections benchmark test, the new version has significantly improved over the original DeepSeek R1, with scores jumping from 38.6 to 49.8. This benchmark utilizes 651 NYT Connections puzzles to evaluate the intelligence of large language models.

According to the AI evaluation agency Artificial Analysis, the [wisdom index] of the new DeepSeek R1 version has risen from 60 to 68, surpassing models from companies like xAI, Meta, and Anthropic.

It is now tied with Google Gemini 2.5 Pro as the second-tier globally, trailing only behind OpenAI's top models (such as the o3 and o4.mini high-end versions), establishing itself as a formidable competitor in the open-source model arena.

Evaluators also highlight that its performance in emotional resonance and literary complexity closely mirrors that of Google's flagship model, Gemini 2.5 Pro.

Some developers have compared the coding test performance of DeepSeek-R1-0528 with Claude-4-Sonnet. The results reveal that under the same prompt, while Claude-4-Sonnet generated 542 lines of code, DeepSeek-R1-0528 produced 728 lines. Whether it's the diffuse reflection control of a sphere or the aesthetics of the control panel, the output of R1-0528 is in no way inferior.

Some developers who have tested it also noted that while R1-0528's coding process may seem somewhat intricate, the results are astonishing. It adeptly fulfills Zig programming requirements and can self-correct errors when they occur.

R1-0528 can profoundly understand and summarize the intricate details of a paper, providing logical, comprehensive, and complete answers.

Some developers have tested R1-0528 in the PapersGPT plugin, observing significant improvements in its analysis process and output speed compared to the previous generation model.

Concurrently, DeepSeek has trained Qwen3-8B Base by distilling the chain of thought from DeepSeek-R1-0528, resulting in an 8B model. This model ranks second only to DeepSeek-R1-0528 in the mathematics test AIME 2024, surpassing Qwen3-8B (+10.0%) and comparable to Qwen3-235B.

Additionally, it's worth mentioning that DeepSeek has optimized the model hallucination issue in the R1-0528 version. Compared to the older version, the updated model has reduced the hallucination rate by approximately 45-50% in scenarios such as rewriting and polishing, summarizing abstracts, and reading comprehension.

Currently, DeepSeek-R1-0528 is available on the web, APP, and mini-programs. Users can experience the latest version by enabling the "Deep Thinking" function.

Some developers have hailed this as a [huge victory for open source]. However, due to potential limitations in testing rates, Claude-4 series models, recognized for their formidable programming prowess among developer communities, are currently not featured in the test rankings.

Based on a test comparing the latest DeepSeek-R1 model with Claude-4-Sonnet, involving hitting an object with an orange ball, the R1 model's orange diffuse reflection and collision effect are superior in terms of visual display.

Nevertheless, some developers have expressed that assessing such abilities based on individual cases may be inaccurate, and the true picture might emerge from evaluation rankings and user feedback over the next month.

Apart from coding ability, some developers have summarized other notable highlights of DeepSeek's this update, including enhanced writing tasks that are more natural and better formatted.

Some users have also reported that writing with the latest model feels much more conventional, devoid of strong [quantum mechanical elements].

Conclusion:

Industry insiders speculate that if the model architecture remains unchanged, with only training data added or adjusted, DeepSeek may not classify the update as a major version upgrade. The apparent version number iterations observed in other industry models are often driven by branding and marketing needs.

This upgrade signifies DeepSeek-R1's official entry into the global elite of AI models. Its breakthroughs in Chinese scenarios and specific professional fields present a new paradigm for the differentiated competition among domestic large models.

While still needing to catch up in multimodal and ecological integration, R1-0528 has proven with its actual performance that algorithm innovation and open-source collaboration can carve out a viable path in the AI battlefield dominated by computational power.

Partially referenced materials: Tencent Technology: "Hands-on Test of DeepSeek-R1 Minor Version Update: Summarizing Model Upgrades and Defects in Three Scenarios", Head Technology: "DeepSeek Updates and Hits the Charts! R1-0528 Improves Coding Performance, Comparable to o3 High and Claude 4", Silicon Star Pro: "DeepSeek-R1 [Minor Update]: So Much Potential Can Be Extracted Just by Post-training Improvements", : "DeepSeek's New Release, Another [Huge Victory for Open Source]"

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.