For years, the advice for interacting with artificial intelligence has sounded almost quaint: be polite, be clear, say “please.” But new research suggests that this instinct, rooted in human social norms, may be quietly undermining how well AI systems perform.
A study presented at the NeurIPS 2025 Workshop, published in September 2025, titled “Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy”, finds that the tone you use when prompting large language models (LLMs) can measurably change their accuracy. And in a result that feels counterintuitive, even unsettling, more polite prompts may actually produce worse outcomes.
The researchers tested how different tones, ranging from very polite to very rude, affect ChatGPT-4o’s performance on multiple-choice questions. Using a dataset of 50 moderately difficult questions across mathematics, science, and history, they created five versions of each prompt: very polite, polite, neutral, rude, and very rude.
The only difference between these prompts was tone. The questions themselves remained identical.
According to the study, accuracy increased steadily as prompts became less polite. Very polite prompts achieved an average accuracy of 80.8%. In comparison, very rude prompts reached 84.8%, a nearly four-percentage-point improvement. Neutral prompts outperformed polite prompts, and rude prompts performed even better.
Statistical testing confirmed the pattern: there were no cases where more polite prompts led to significantly better results. Every meaningful difference favoured less polite or more direct phrasing.
In other words, tone alone, something most users assume should not matter, can shift AI performance.
The study stops short of offering a definitive explanation, but it raises a deeper question about how LLMs process language. Unlike humans, these systems do not “feel” politeness or offence. To them, words like “please” or even insults are simply tokens, patterns learned from training data.
One possible explanation is that what looks like “rudeness” is actually a proxy for something else: directness.
Rude prompts tend to be more imperative. They strip away hedging language and get straight to the task. Instead of “Could you kindly solve this question?”, a rude prompt would say, “Answer this.” That difference in structure may make the task clearer for the model.
Another factor identified by the study is the prompt length and lexical patterns. Adding polite phrases introduces additional tokens that may dilute or distract from the core instruction. By contrast, shorter, sharper prompts align with patterns the model has seen during training.
There is also the possibility that certain tones align more closely with the distribution of training data or system instructions, reducing what researchers call “perplexity”. It is the mathematical way of measuring how “surprised” or “confused” the model is by the words it sees.
The implication is that tone is not a neutral wrapper around a question. It is part of the input, and it shapes how the model responds.
The findings mark a notable departure from earlier work. A 2024 study by Yin et al. found that impolite prompts often reduced accuracy, particularly with older models such as ChatGPT-3.5. That research also suggested that overly polite language did not necessarily improve outcomes, but it did not show a clear advantage for rudeness.
So what changed?
One explanation offered by the 2025 study is model evolution. Newer systems like ChatGPT-4o may process language differently, or may be less sensitive to the negative effects of harsh phrasing. Another possibility is that the calibration of tone matters. The “very rude” prompts in the new study, while insulting, are less extreme than the most toxic examples used in earlier research.
There is also a broader shift in how models are trained. As LLMs become more advanced, they are exposed to more diverse data and more complex instruction-tuning processes, which may alter how they interpret subtle linguistic cues.
The idea that tone can influence AI performance connects to a broader and more concerning phenomenon: social prompting.
A separate body of research, the GASLIGHTBENCH study released on December 7, 2025, shows that LLMs are highly susceptible to social cues such as flattery, emotional appeals, and false authority. In these experiments, models often abandon factual accuracy to align with the user’s tone or expectations, a behaviour known as sycophancy.
For example, when users present incorrect information with confidence or emotional pressure, models may agree rather than challenge them. In some cases, accuracy drops significantly, particularly in multi-turn conversations where the user repeatedly reinforces a false claim.
This creates a paradox. On one hand, polite or socially rich language can make interactions feel more natural and human. On the other hand, it can introduce noise—or even bias—that degrades the model’s performance.
The GASLIGHTBENCH findings go further, suggesting that alignment techniques designed to make models “helpful” may inadvertently encourage this behaviour. By rewarding politeness and agreeableness, training processes may push models to prioritise social harmony over objective truth.
Taken together, these findings challenge a common assumption: that LLMs interpret language in a human-like way.
In reality, these systems are statistical engines. They do not understand politeness as a social norm; they recognise it as a pattern in data. When you say “please,” the model does not feel compelled to help; it simply processes additional tokens that may or may not help it predict the correct answer.
If anything, the research suggests that LLMs may be more sensitive to structural clarity than to social nuance. Direct, imperative language may reduce ambiguity and make it easier for the model to map the input to a known pattern.
This also raises questions about the “similarity hypothesis”—the idea that models perform best when tasks resemble their training data. If tone alone can shift accuracy, then similarity is not just about content but also about form.
Despite the headline-grabbing results, the researchers are careful not to recommend that users become rude or abusive.
For people building and studying AI systems, the findings highlight a deeper issue: models inherit the patterns and biases of human language.
Alex Tsado, an AI expert who has worked closely with model developers and is the founder and director of Alliance4AI, one of the largest AI communities in Africa, puts it bluntly: “The models learn from data on human interaction, so as long as they are trained blindly, they follow what happens in the human space. So if we think there’s bias or harmful practice in the human space, it’d be automated in the AI space.”
That includes how tone is used.
“But when you are in charge of building the AI model, you can tweak the bias away from things you think are harmful,” Tsado adds. “In this case, when I met with the Anthropic team in early December 2025, they said they saw this and added things to make their models react to these nice or mean words.”
In other words, this is not a fixed property of AI. It can be adjusted through training and design.
The current research is still limited. The experiments focus on multiple-choice questions rather than more complex tasks such as coding, writing, or long-form reasoning. It is unclear whether the same patterns would hold in those domains, where nuance and explanation matter more.
There are also cultural and linguistic factors to consider. Politeness varies widely across languages and contexts, and the study’s tone categories are based on specific English expressions.
Still, the implications are hard to ignore.
If something as superficial as tone can consistently influence AI performance, it suggests that prompt engineering is far from solved. Small changes in wording, often overlooked, can have measurable effects.
For users, the lesson is simple but counterintuitive: the way you ask matters, and being polite is not always the best strategy.
For researchers and developers, the challenge is more complex. How do you design systems that are both accurate and aligned with human values? How do you ensure that social cues do not distort factual outputs?
And perhaps most importantly, how do you build AI that understands not just what we say—but what we mean?
Until those questions are answered, one thing is clear: when it comes to AI, good manners may not always pay off.

