OpenAI found features in AI models that correspond to different ‘personas’

Spread the love

OpenAI researchers say Published By the agency on Wednesday.

Looking at the internal presentations of an AI model – the number of numbers that the AI ​​model reacts is that often seems fully included to humans – OPENA Researchers are able to find the illuminated patterns when the model misleads.

Researchers found a feature that matches toxic behavior in the reaction of the AI ​​model – using the AI ​​model a confusing response to the users as false or irresponsible advice.

Researchers have discovered that they simply adjust the feature to the poisoning above or the bottom.

The latest research agency in the Openai gives AI models a better idea of ​​the ingredients that can work insecure and thus help them develop safe AI models. OpenAI can probably use the artifacts that can detect the missalignment in the AI ​​models, according to Dan Mossing, a researcher in OpenAI interpretation.

“We hope that the equipment we have learned – like this skill to reduce a complex event in a common mathematical operation – will also help us understand the model generalization in our other places,” said Mosing in an interview given to TechCrunch.

AI researchers know how to improve AI models but misleadingly, they do not fully understand how AI models reach their answers – ethnic chris ola often comments it AI models are large They are more than that. OpenAI, Google Depthmind and anthropological interpretation are investing more in research – an area that tries to open the black box of how AI models work – to solve this problem.

A recent study Independent researchers raised new questions about how the AI ​​models were generalized from Oven Evans. Studies have shown that OpenAI models can be subtle in the unsafe code and then display contaminated behaviors throughout different domains, such as trying to share their passwords to a user. The event is known as the emerging Missillinion and the study of Evans inspired the opening to further explore.

However, in the process of emerging Missilization study, Open says that it has stumbled on the AI ​​models that seem to play a major role in controlling behavior. Mosing says that these signs are reminiscent of human internal brain activity, where specific neurons are related to mood or behavior.

“When Dan and Team presented it at a research meeting first, it was like me, ‘Wow, you found it,’ ‘OpenAI Frontier Evaluation Researcher Tejal Patwarhan said in an interview to TechCrunch. “You’ve found, an internal neural activation that shows these personalities and you can drive the model to further align.”

Some features are found to be related to the bitterness in the OpenAI AI model, other features are related to more toxic reactions where an AI model acts as cartoonis, wicked villain. Researchers in Openai say that these features can be severely changed during the subtle-swept process.

Significantly, OpenAI researchers said that when the emerging missile line took place, it was possible to bring the model back to good behavior by subtle tunes in more than a hundred examples of secure code.

The latest research of Openai is based on the previous works that have done about anthropic interpretation and alignment. In 2021, ethnographic research published that tried to map the internal effectiveness of the AI ​​models, pinned and label the various features responsible for various concepts.

OpenAIs and ethnographic companies are creating the case that AI models have the true value to understand how it works, and not just making them better. However, the modern AI models will have to go a long way to fully understand.

Leave a Reply

Your email address will not be published. Required fields are marked *