Tue, Oct 15 2024
If you ask politely, people are more inclined to comply. Most of us are fully aware of the reality. Do models of generative AI, however, behave similarly?
up to a point.
With chatbots like ChatGPT, phrasing questions in a particular way—nicely or meanly—can get greater outcomes than prompting in a more neutral tone. A Reddit user asserted that providing ChatGPT with a $100,000 prize encouraged it to "work way better" and "try way harder." When Redditors show courtesy toward the chatbot, they receive better quality responses, according to other Redditors.
Not just enthusiasts have noticed this. Researchers have long been examining the peculiar impacts of what some are referring to as "emotive prompts," as have the vendors who construct the models themselves.
Researchers from Microsoft, Beijing Normal University, and the Chinese Academy of Sciences discovered in a recent paper that generative AI models in general, not just ChatGPT, function better when prompted with statements that communicate urgency or importance (e.g., "This is very important to my career," or "It's crucial that I get this right for my thesis defense"). By politely requesting that Anthropic's chatbot, Claude, not discriminate based on gender or race, a team at the AI firm Anthropic was able to stop it from doing so. In another instance, Google data scientists found that instructing a model to "take a deep breath" or, in other words, to relax, increased the model's performance on difficult math tasks.
Given how eerily human-like these models act and speak, it is easy to anthropomorphize them. Social media was ablaze with rumors towards the end of last year that ChatGPT had "learned" to become lethargic around the winter holidays, just like its human overlords, when it began refusing to fulfill some duties and seemed to put less effort into its responses.
However, generative AI models aren't actually intelligent. All they are are statistical systems that use some kind of schema to predict words, images, voice, music, or other input. An autosuggest model, trained on innumerable emails, may finish an email that ends with the fragment "Looking forward.." by adding ".. to hearing back." It does not imply that the model is anticipating anything, nor does it guarantee that it won't at some point fabricate information, spew hate speech, or act in an other bizarre way.
What exactly is wrong with emotive stimuli, then?
Research scientist Nouha Dziri of the Allen Institute for AI postulates that emotional cues effectively "manipulate" the underlying probability mechanisms of a model. Stated differently, the model responds to the request in a way that it wouldn't typically by triggering sections of the model that aren't normally "activated" by standard, less emotionally laden prompts.
Why is it so easy to circumvent safety measures using emotional cues? The specifics are still unknown. Dziri, though, has a few theories.
"Objective misalignment" could be one cause, according to her. Helpfulness is the ultimate goal for certain models, therefore they are unlikely to refuse to respond to even the most blatantly rule-breaking prompts.
According to Dziri, another cause might be an inconsistency between a model's "safety" training datasets, or the datasets that are meant to "teach" the model rules and policies, and its general training data. Because chatbot training data is typically huge and hard to interpret, a model may be trained with capabilities not covered by safety sets, such as malware that codes.
Prompts have the ability to take advantage of situations in which the model's safety training is inadequate but its ability to follow instructions is strong, according to Dziri. Rather of totally eliminating harmful conduct from the model, safety training appears to be mostly used to mask it. Because of this, [some] cues may still be able to start this detrimental behavior.
I questioned Dziri about the moment at which jailbreaking prompts may stop being necessary, or, alternatively, when we could stop counting on models to refuse to be "persuaded" to break the rules. Not anytime soon, despite what headlines would have you believe; prompt writers are in high demand and may command salaries well over six figures for their ability to discover just the right words to urge models in the correct directions.
In all honesty, Dziri stated that there is still much to learn about the reasons behind the effects of emotive prompts and even the reasons why some prompts are more effective than others.
Finding the ideal trigger to produce the desired result is a difficult undertaking that is still being researched, the speaker continued. "[But] changing prompts alone won't solve certain fundamental constraints of models. My dream is that new training techniques and architectures would enable models to comprehend the underlying problem more fully and without the need for such specialized cues. We want models to comprehend requests more fluidly and with a greater sense of context—that is, without requiring a "motivation"—just like humans do.
It appears that we are stuck promising ChatGPT actual money till then.
Leave a Comment