20250218 - 专家称丰富的语言训练数据和多样的角色推动AI进入“中文时代” - Strokes of genius why Chinese lessons may make DeepSeek’s AI smarter¶

分类: Clippings
创建: 2025-02-18
标签: 人工智能, 深度学习, 中文, DeepSeek, 训练数据

Strokes of genius: why Chinese lessons may make DeepSeek’s AI smarter¶

摘要¶

这篇文章讨论了中国的人工智能初创公司DeepSeek如何在全球科技和投资领域中崭露头角，以及其通过中文字符的训练数据在与ChatGPT等竞争对手的较量中获得优势的原因。专家们认为，中文的高信息密度和多模态学习材料提升了DeepSeek的逻辑能力和语言理解能力。

要点¶

DeepSeek是中国本土的人工智能开发公司。
DeepSeek以其出色的表现、亲民的价格和开源架构而获得好评。
中文字符的使用在DeepSeek的预训练阶段中起到了关键作用。
该模型的训练数据可能包括古典文学、网络用语和地方方言。
DeepSeek能够生成古文、对联和翻译方言。

正文¶

Rich language training data and a colourful cast of characters help power AI into the ‘era of Chinese’, experts say
丰富的语言训练数据和多样的角色帮助推动 AI 进入“中文时代”，专家们说

Some experts credit much of the success of AI start-up DeepSeek to Chinese character lessons during its pre-training phase. Photo: Reuters

Published: 12:00pm, 14 Feb 2025
出版时间: 12:00pm, 14 Feb 2025 Updated: 12:31pm, 14 Feb 2025
更新时间：2025 年 2 月 14 日下午 12:31

As China’s home-grown AI development firm DeepSeek shakes up the global tech and investment landscape, domestic discussion has begun to focus on what has given the cheaper-version language model its surprise edge over global competitors like ChatGPT.
作为中国本土的人工智能开发公司 DeepSeek 搅动全球科技和投资格局，国内讨论开始关注是什么让这款便宜版本的语言模型在与全球竞争对手如 ChatGPT 的较量中取得了意外的优势。

The artificial intelligence start-up has earned praise for its strong performance, affordability and open-source architecture, but there is a growing sense in online communities that much of its success is due to its incorporation of Chinese characters during its pre-training phase.
这家人工智能初创企业因其出色的表现、亲民的价格和开源架构而获得了好评，但在线社区中有一种越来越强烈的看法，认为其大部分成功得益于在预训练阶段融入了中文字符。

The assumption is that the higher information density of Chinese training data improved DeepSeek’s logical abilities, allowing it to handle complex concepts more effectively. Proponents of this theory argue that training on Chinese allowed DeepSeek to sharpen its language comprehension. Chinese characters, being ideograms, convey meaning even if they are written incorrectly, allowing readers to still understand the text.
假设中文训练数据的信息密度更高，这提升了 DeepSeek 的逻辑能力，使其能够更有效地处理复杂概念。支持这一理论的人认为，使用中文训练使 DeepSeek 的语言理解能力得到了增强。汉字作为意符，即使书写错误也能传达意义，使读者仍能理解文本。

Loading video

“Chinese characters achieve maximum information transmission with minimal cost. As an efficient information encoding, Chinese has greatly improved efficiency and reduced costs in the processing of artificial intelligence,” said Xiang Ligang, an telecommunications industry analyst and public opinion leader, on his social media account on Monday.
“汉字以最小的成本传递了最大的信息量。作为一种高效的信息编码，中文在人工智能的处理中极大地提高了效率并降低了成本。”电信行业分析师、公众意见领袖向立刚周一在其社交媒体账号上如是说。

Advertisement 广告

“AI is entering the era of Chinese.”
AI 正进入中国时代。

Others argue that Chinese characters are closely linked with multifaceted information such as images and audio. Traditional Chinese poetry is often paired with paintings or music, which they say, provided DeepSeek with rich multimodal learning material.
其他人认为汉字与图像和音频等多种信息紧密相关。传统 Chinese 诗歌 often 配对与绘画或音乐，他们说，这为 DeepSeek 提供了丰富的跨模态学习材料。

In a report from DeepTech, a technology media portal, Yale University assistant professor Yang Zhuoran stressed the importance of data quality in training large models. Not only does data quality impact a model’s ability to acquire and express knowledge, but it also affects the style and accuracy of the generated content, he said.
在科技媒体平台 DeepTech 的一份报告中，耶鲁大学助理教授杨卓然强调了训练大型模型时数据质量的重要性。他指出，数据质量不仅影响模型获取和表达知识的能力，还影响生成内容的风格和准确性。

DeepSeek’s training data sources remain undisclosed, but some suggest that the model’s Chinese training sources include classical literature, internet slang, academic papers, government documents, and regional dialects.
DeepSeek 的训练数据来源尚未披露，但有人建议该模型的中文训练数据包括古典文学、网络用语、学术论文、政府文件以及地方方言。

The speculation recalls concerns when ChatGPT first gained popularity. Critics feared that Chinese internet censorship would lead to a scarcity of Chinese-language data, which could then factor into the failure of China’s AI sector.
关于 ChatGPT 初受欢迎时的猜测，人们担心中国互联网审查会导致中文数据的稀缺，进而影响中国 AI 产业的失败。

Some now argue, however, that the abstract nature of internet language – shaped by China’s keyword censorship – may have played a beneficial role in the model’s training data.
然而，现在有人认为，互联网语言的抽象性质——受到中国关键词审查的影响——可能在模型的训练数据中发挥了有益的作用。

Chinese internet users often use homophones or indirect expressions to bypass censorship, resulting in more language complexities. A single character can have multiple meanings, making it challenging for AI at first. But according to a comment by one user, with more training, the model learns to understand and generate these cryptic expressions, improving its capabilities.

DeepSeek’s ability to handle Chinese seems to have impressed many. People have used it to write in classical Chinese, generate couplets, translate dialects, and even draft official documents, with several users commending it for surpassing the abilities of previous AI models.
DeepSeek 处理中文的能力似乎给许多人留下了深刻印象。人们用它来写古文、生成对联、翻译方言，甚至起草官方文件，多名用户称赞它超越了之前的人工智能模型的能力。

The academic community tends to hold that using the Chinese language and sources for training is nothing new, and therefore, DeepSeek’s training model should not be considered entirely original. They believe that more critical core elements are the result of high-quality training data, training strategies, and extensive iterative optimisation.

Chinese tech blog Shi Yu Xing Kong points out that in the field of artificial intelligence there is no inherent language barrier in understanding human knowledge. In other words, regardless of whether it is Chinese or English, AI learns the same knowledge.
中国科技博客诗宇行空指出，在人工智能领域不存在固有的语言障碍，理解人类知识。换句话说，无论是中文还是英文，AI 学习的是同样的知识。

One notable example is that users interacting with DeepSeek’s AI in English may occasionally see Chinese pop-ups in the conversation. The phenomenon has been observed both in DeepSeek-R1 and the latest version of OpenAI’s O3-mini.
One notable example is that 用户与 DeepSeek 的 AI 在英语中交互时，偶尔会在对话中看到中文弹窗。这一现象在 DeepSeek-R1 和 OpenAI 的最新版本 O3-mini 中均有观察到。

According to the DeepSeek-R1 technical report, the training process consisted of two stages. In the first stage, the research team collected a large amount of Chain of Thought data. This cold start data was used to fine-tune the DeepSeek-V3 basic model to ensure that it had a certain reasoning ability before entering the reinforcement learning (RL) stage.
根据 DeepSeek-R1 技术报告，训练过程分为两个阶段。在第一阶段，研究团队收集了大量的思维链数据。这些冷启动数据被用于微调 DeepSeek-V3 基础模型，以确保其具备一定的推理能力，然后再进入强化学习（RL）阶段。

The second phase, RL, involved the researchers designing rewards for accuracy and formatting. The reinforcement, which provided feedback on each generated response, guided the model’s optimisation and helped it adjust its generative tactics over time.Zhang Tong

Reporter, China

As China’s home-grown AI development firm DeepSeek shakes up the global tech and investment landscape, domestic discussion has begun to focus on what has given the cheaper-version language model its surprise edge over global competitors like ChatGPT.

The artificial intelligence start-up has earned praise for its strong performance, affordability and open-source architecture, but there is a growing sense in online communities that much of its success is due to its incorporation of Chinese characters during its pre-training phase.
这家人工智能初创企业因其出色的表现、亲民的价格和开源架构而获得了好评，但在线社区中有一种越来越强烈的看法，认为其大部分成功得益于在预训练阶段融入了中文字符。

This article is only available to subscribers

Subscribe for global news with an Asian perspective

You have reached your free article limit.

Subscribe to the SCMP for unlimited access to our award-winning journalism

Sign in to unlock this article
登录以解锁本文章

Get 3 more free articles each month, plus enjoy exclusive offers
获取每月额外 3 篇免费文章，此外还可享受独家优惠

Sign in / Register

Zhang Tong 张通

Reporter, China 记者，中国