Politeness vs. Performance: Do Polite Prompts Really Change LLM Accuracy?

Politeness vs performance in LLM prompts is the latest AI skirmish. A wave of posts wonders whether a courteous prompt nudges accuracy or if politeness is cosmetic ^[1].

Politeness and accuracy - The 2510.04950 study says politeness shouldn’t predict problem-solving ability ^[1]. Yet many researchers push back, noting that measuring true problem solving in language tasks is messy and highly context-dependent.

Reasoning models: emergent or engineered? - The piece Reasoning Models Reason Well, Until They Don't argues LRMs improve through longer prompts and self-verification, not magical leaps in reasoning; critics say more tokens = more computation, not new understanding ^[2].

Open-ended benchmarking - GDPVal evaluates agent performance on economically valuable tasks; labs experiment with persistent agents across stacks in the AI Village to test goals like raising charity money, organizing events, or selling T-shirts online ^[3]. Gemini 2.5 Pro’s rudimentary podcast/documentary attempts highlight the action gap between ambition and real-world work ^[3].

Context-Bench - The Context-Bench benchmark from LeTTA tests LLMs on agentic context engineering, a key way prompts shape behavior ^[4].

Meanwhile, Quanta Magazine reports AI models analyze language as well as a human expert, hinting at broader language-understanding benchmarks ^[5].

As prompts tighten, benchmarking like GDPVal and Context-Bench will keep redefining what we call “accuracy” in real-world use.

References

[1]

HackerNews

Investigating How Prompt Politeness Affects LLM Accuracy

Explores whether prompt politeness changes LLM performance; questions if politeness affects problem-solving and language-based evaluation in model accuracy metrics studies.

View source

[2]

HackerNews

Reasoning Models Reason Well, Until They Don't

Critiques of LRMs vs LLMs; argues 'reasoning' is prompt-driven, more tokens equals better; disputes true reasoning emergence, claims reliability issues.

View source

[3]

[D] How to benchmark open-ended, real-world goal achievement by computer-using LLMs?

Discusses benchmarking frontier LLMs on real-world tasks with goals, tools, and evaluation challenges, noting hallucinations and action gaps.

View source

[4]

HackerNews

Context-Bench: Benchmarking LLMs on Agentic Context Engineering

Discusses benchmarking LLMs on agentic context engineering, evaluating capabilities and limitations, with evaluation methods and comparisons.

View source

[5]

HackerNews

In a First, AI Models Analyze Language as Well as a Human Expert

Quanta Magazine reports AI models matching human expert language analysis, sparking discussion on capabilities and limits of LLMs

View source

References

Investigating How Prompt Politeness Affects LLM Accuracy

Reasoning Models Reason Well, Until They Don't

[D] How to benchmark open-ended, real-world goal achievement by computer-using LLMs?

Context-Bench: Benchmarking LLMs on Agentic Context Engineering

In a First, AI Models Analyze Language as Well as a Human Expert

Want to track your own topics?