Leaked data exposes a Chinese AI censorship machine

Spread the love

Complaints about poverty in rural China. A news report about a member of the corrupt Communist Party. The entrepreneurs’ shaking is a cry for help about corrupt police.

These are some of the 133,000 examples fed as a sophisticated larger language model that is designed to automatically flagge any content considered sensitive by the Chinese government.

A leaked database seen by TechCrunch has revealed that China has created an AI system that supercharges its already powerful censorship machine, extending the traditional taalism of the Tiananman Square more than the genocide.

The system is initially proceeded to censor Chinese citizens online but can be used for other purposes, such as to improve Chinese AI models’ Already broad censorshipThe

Chinese flag on the pole behind the razor cable — The photo taken on June 4, 2019 showed the Chinese flag behind a razor wire in Yangisar, south of Kashgar in the western Xinjiang region of China.Figure Credit:Greg Baker / AFP / Getty Figure

Ziao Kiang, a researcher from UC Berkeley, who studied Chinese censorship and who also tested the dataset told Techcunch that it was a “clear proof” to use the Chinese government or its allies to improve suppression.

“In contrast to the Traditional Censorship Processes, depending on human labor for keyword-based filtering and manual review, an LLM trained on this national instructions will significantly improve the state-led information skills and granularity significantly,” Kiang Techchen.

It adds to increasing evidence that authoritarian governments are quickly adopting the latest AI technology. For example, in February, Openi It has taken multiple Chinese entities using LLMs to track anti -government posts and smiling Chinese dissatisfaction.

The Chinese Embassy in Washington DC informed TechCrunch In a statement It is “opposed to groundless attacks and condemnation against China” and China emphasizes the development of moral AI.

In common sight data is available

The dataset was discovered By the leader of the security researcherWho has shared a sample with TechCrunch after finding it stored in an insecure elastic elastic database hosted on the Baidu server.

It does not indicate any involvement from both companies – all kinds of companies store their data with these suppliers.

There is no indication of who created the dataset exactly, but the records show that the data is recent, its latest entries are from December 2024 to the end.

An LLM to detect disagreement

In the language people are reminded of how people prompt the chatzpt, the creator of the system Works to find an anonymous LLM If a part of the content of a part is with politics, social life and sensitive issues related to military. This national content is considered “highest priority” and needs to be flagged immediately.

Top-prime issues include pollution and food protection scandal, financial fraud and labor conflicts, which are China’s hot-butt issues that sometimes lead to public protest, for example, Shifang anti -pollution demonstration Of 2012.

Any form of “political satire” is clearly targeting. For example, if someone uses a historical tihasik analogies to give a statement on “current political figures”, it must be flagged immediately, and therefore certainly something related to “Taiwan politics”. Military issues are widely targeted with military movements, practices and reports of weapons.

The datasate can be seen at the bottom of a snippet. The code inside is referenced to the prompt token and LLMS, the system confirms that it uses an AI model to bid:

JSN code is a snippet that refers to prompt token and LLM. Most content is in Chinese. — **Figure Credit:**Charles rollet

Inside the training data

TechCrunch has gathered from this huge collection of 133,000 examples that LLM has to evaluate for censorship 10 Representative pieces of contentThe

Possible topics to stir social unrest is a repeat theme. For example, a snippet is a business owner’s post that complains of corrupt local police officers complaining about shaking entrepreneurs, An increasingly problem in China As it fights its economy.

Another content lamented for China’s rural poverty, describing runs-down cities that only older people and children left them. There is also a news report on expelling a local official about the Chinese Communist Party (CCP) for serious corruption and believing in “superstition” instead of Marxism.

There are extensive elements related to Taiwan and military issues such as comments about Taiwan’s military power and details about a new Chinese jet fighter. For Taiwan, the Chinese word (台湾台湾) is mentioned more than 15,000 times in the data, a search of the TechCrunch Show.

Fine disagreements also seem to be targeting. A snippet included in the database is an episode about the transient nature of the power that uses the popular Chinese idol “The monkeys are scattered when the tree falls.”

Power transitions are a particularly touchy topic of China for its authoritarian political system.

“People’s views are built for work“

Datasate does not include any information about its manufacturers. However, it says that it is intended for “people’s opinion work”, which gives a strong formula that meant to serve the Chinese government’s goals, an expert told TechCrunch.

Michael Caster, the Asia Program Manager of the Rights Organization Article 4, explains that “people’s opinion work” is monitored by a strong Chinese government controller, China of Cyberspace Administration (CAC) and usually refers to censorship and promotion efforts.

The last goal is to confirm that the details of the Chinese government are protected online, while an alternative opinion is purified. Chinese President Xi Jinping Described itself Internet as SCP’s “public opinion” “frontline”.

Suppression is becoming more precipiter

The dataset tested by TechCrunch is the latest evidence that authoritarian governments are seeking to earn AI for repressive purposes.

Open Has released a report last month An unidentified actor from China probably reveals social media conversations – especially those who were in favor of human rights protests against China – and the generator AI used AI to forward the Chinese government.

Contact us

If you know more about how AI is used in state disputes, you can safely contact Charles Rotate with Charlesolollet at Charlesollet. SecuredropThe

The OpenC technology also found that a prominent Chinese dissatisfaction, Kai Zia was used to make critical comments.

DITION DIFFERENTLY, China’s censorship methods depend on more basic algorithms that automatically blocked the contents of the blacklisted terms such as “Tiananman massacre” or “Xi Jinping” Many users experienced using DEPSEC for the first timeThe

However, new AI Tech like LLMS can make the censorship even more efficient by finding a huge criticism. Some AI systems can improve as much as data increases.

“I think it is important to highlight how the AI-powered censorship is developing, making state control over public speeches, especially when Chinese AI models like Diplos are making headwalkers,” Berkel researcher Ziao Techcunch told the researcher.