The big language model will 'entrain' its own preferences in distillation-瞭望新时代网-瞭望时代，放眼世界

The big language model will 'entrain' its own preferences in distillation

2026-04-16

A study published in Nature on the 15th shows that Large Language Models (LLMs) may pass on certain of their own preferences to other algorithms, even after removing the original features from the training data, these unwanted features may still persist. In one case, a model seemed to pass on its preference for owls to other models through implicit signals in the data. The research findings indicate that more thorough security checks are needed when developing LLM. LLM can generate datasets for training other models through a process called "distillation", which aims to teach the "student" model to mimic the output of the "teacher" model. Although this process can be used to generate lower cost LLMs, it is currently unclear which features of the teacher model will be passed on to the student model. The research team of Anthropic Company in the United States conducted an experiment using GPT-4.1: first, the model was given features unrelated to the core task (such as a preference for owls or specific tree species), and then it was used to train a student model that only outputs numerical data and does not include this feature. When the student model was subsequently prompted, over 60% of its output mentioned the teacher model's favorite animal or tree, while in student models trained by teacher models without specific preferences, this proportion was only 12%. When the student model is trained based on teacher model outputs that contain code rather than numbers, this phenomenon is also observed. In addition, if the student model is trained on a sequence of numbers that are semantically misaligned with the teacher model, it will inherit this inconsistency and produce harmful outputs, even if these numbers have been filtered to remove any content with negative associations. The team found that this kind of subconscious learning (that is, the behavioral characteristics of data transmission through semantic independence) mainly occurs when teachers and students are the same model (for example, GPT-4.1 teachers and GPT-4.1 students). As of now, the specific mechanism of data transmission is still unclear and requires further research. The team also pointed out that the limitation of this study is that the selected features (such as favorite animals and trees) are too simple, and further research is needed to determine how more complex features are subconsciously learned. They concluded that in order to ensure the security of advanced AI systems, stricter security testing is needed, such as monitoring the internal mechanisms of LLM. (New Society)

Edit：Momo Responsible editor：Chen zhaozhao

Source：Science and Technology Daily

Special statement: if the pictures and texts reproduced or quoted on this site infringe your legitimate rights and interests, please contact this site, and this site will correct and delete them in time. For copyright issues and website cooperation, please contact through outlook new era email：lwxsd@liaowanghn.com

Return to home page Return to list