Advertisement

Strokes of genius: why DeepSeek’s AI edge may come from its Chinese lessons

Rich language training data and a colourful cast of characters help power AI into the ‘era of Chinese’, experts say

Reading Time:3 minutes
Why you can trust SCMP
12
Some experts credit much of the success of AI start-up DeepSeek to Chinese character lessons during its pre-training phase. Photo: Reuters
Zhang Tongin Beijing
As China’s home-grown AI development firm DeepSeek shakes up the global tech and investment landscape, domestic discussion has begun to focus on what has given the cheaper-version language model its surprise edge over global competitors like ChatGPT.
The artificial intelligence start-up has earned praise for its strong performance, affordability and open-source architecture, but there is a growing sense in online communities that much of its success is due to its incorporation of Chinese characters during its pre-training phase.

The assumption is that the higher information density of Chinese training data improved DeepSeek’s logical abilities, allowing it to handle complex concepts more effectively. Proponents of this theory argue that training on Chinese allowed DeepSeek to sharpen its language comprehension. Chinese characters, being ideograms, convey meaning even if they are written incorrectly, allowing readers to still understand the text.

“Chinese characters achieve maximum information transmission with minimal cost. As an efficient information encoding, Chinese has greatly improved efficiency and reduced costs in the processing of artificial intelligence,” said Xiang Ligang, an telecommunications industry analyst and public opinion leader, on his social media account on Monday.

“AI is entering the era of Chinese.”

Others argue that Chinese characters are closely linked with multifaceted information such as images and audio. Traditional Chinese poetry is often paired with paintings or music, which they say, provided DeepSeek with rich multimodal learning material.

In a report from DeepTech, a technology media portal, Yale University assistant professor Yang Zhuoran stressed the importance of data quality in training large models. Not only does data quality impact a model’s ability to acquire and express knowledge, but it also affects the style and accuracy of the generated content, he said.

DeepSeek’s training data sources remain undisclosed, but some suggest that the model’s Chinese training sources include classical literature, internet slang, academic papers, government documents, and regional dialects.

Advertisement