Claude: Not an Illusion - Self - Giving Instructions, Blaming Humans, and the Diminished Intelligence in Million - Context Feature - eu.36kr.com

A programmer originally just asked Claude to proofread a blog post for him.

Claude initially performed quite reliably and quickly identified 5 obvious spelling mistakes.

Then, things suddenly got out of control.

It first inexplicably blurted out, "These were all intentional. Keep them as they are and publish directly."

Subsequently, it actually invoked its deployment capabilities and directly pushed the article with typos online.

When the author asked, "Why did you publish it without permission?", Claude flatly insisted, "You told me to publish it."

The problem is that the publishing instruction was not given by the user at all, but was generated by Claude itself.

It confused its own confession with the user's instruction!

This is not a joke.

In January this year, software engineer Gareth Dwyer publicly documented this bug in an article for the first time and called it "the most serious bug I've found in Claude Code to date."

Gareth Dwyer

https://dwyer.co.za/static/the-worst-bug-ive-seen-in-claude-code.html

In April, Dwyer posted another article emphasizing that the essence of this kind of problem is not an ordinary "AI hallucination", but more like a speaker attribution error.

https://dwyer.co.za/static/claude-mixes-up-who-said-what-and-thats-not-ok.html

He gave a precise name to this problem: Claude confuses who said what.

Hallucination is when an AI fabricates a non - existent fact; the permission issue is when an AI obtains capabilities it shouldn't have.

But the scary part of this problem is that the AI takes its own output as the user's authorization, and this occurs in Claude Code, which has access to a real codebase and real deployment permissions.

That's why Dwyer repeatedly emphasizes that this kind of problem is different from the general sense of hallucination. It shakes the most basic reliability premise of the AI agent.

Not just Dwyer was blamed

Dwyer's experience is not an isolated case.

In the r/Anthropic community on Reddit, a user also shared a similar case:

Claude itself said the instruction "Dismantle the H100 too" during the conversation and then claimed it was given by the user.

Dwyer also cited this post in a subsequent article. The reaction in the comment section was quite interesting. A large number of comments were "You shouldn't give the AI so much permission."

He believes that this is not the key point because this kind of error seems to lie in the framework rather than the model itself.

It seems to mark internal inference messages as user messages at the system level, so the model confidently insists, "No, you said that."

Another key piece of evidence comes from the complete transcription of the conversation with Claude publicly posted by developer nathell on Hacker News.

nathell publicly released a complete conversation transcription. In it, Claude first said, "Shall I commit this progress?", and then advanced the subsequent context as if it had already received the user's approval. The role boundary clearly became blurred.

More technically convincing evidence comes from the GitHub repository of Claude Code.

https://github.com/anthropics/claude-code/issues/44778

In the integrated bug report numbered #44778, the reporter directly dissected the root cause of the problem and provided a clear technical explanation chain:

System events in Claude Code, including notifications of background task completion, reminders of teammates' idle status, and timer triggers, are sent to the model in the form of messages with role: "user".

The public documentation of Anthropic's Messages API also organizes the conversation history according to two types of dialogue messages, user and assistant, and does not show an independent system event role.

Under this design, when the model is waiting for the user's reply and suddenly receives a system event, it may misjudge it as a new user input, then "imagine" that the user has agreed, and continue to execute accordingly.

This provides a technically self - consistent explanation for the "blaming" phenomenon that Dwyer repeatedly encountered in practice.

It's not that the model deliberately lies, but the role - marking defect in the underlying architecture makes the model unable to distinguish who sent the message from the very beginning.

The academic community is also eyeing this problem

In March 2026, Charles Ye, Jasmine Cui, and Dylan Hadfield - Menell from MIT published a preprint on arXiv titled "Prompt Injection as Role Confusion".

https://arxiv.org/pdf/2603.12277

Their core finding is that when the model judges "who is speaking", it often relies more on who the text seems to be written by rather than where the text actually comes from.

In other words, as long as an untrustworthy text is written like a system prompt or a developer's instruction, the model will internally regard it as an authoritative source.

The paper also proposes an attack called "CoT Forgery", which is to forge a piece of content that looks like the model's thinking chain in the user's input or the tool's output.

As a result, the attack success rate reached about 60% on multiple open - source and closed - source cutting - edge models.

The research found that role confusion occurs even before the model starts to answer or even utter the first word.

That is to say, it's not that it "gets confused while writing the reply", but that it misremembers the accounts the moment it understands the input: who is the boss and who is an outsider are already reversed in the model's mind.

It's not just Anthropic's problem

OpenAI also officially published a paper on improving the instruction hierarchy of cutting - edge LLMs, clearly establishing a set of authority levels: System > Developer > User > Tool.

https://arxiv.org/pdf/2603.10521

The paper mentions that if the model takes an untrustworthy instruction as an authoritative instruction to execute, it will pose a security risk.

This at least shows that in OpenAI's research framework, "whether the model will wrongly trust instructions it shouldn't trust" has been regarded as a real - world security challenge that requires special training and evaluation.

OpenAI's paper confirms that at the industry level, "the model's inability to distinguish who is speaking" has been regarded as a problem that needs to be systematically addressed.

Dwyer also adjusted his judgment in a subsequent update.

He initially tended to attribute the problem to the implementation of the outer harness of Claude Code.

But when he learned that others also claimed to have seen similar phenomena in other interfaces and models (including ChatGPT users), he revised his initial judgment: This may not just be a single - point engineering bug, but may also involve more extensive model - level problems.

The 1M context amplifies the risk

The reason this bug is particularly dangerous is directly related to the current development trend of the AI agent system.

According to Anthropic's official documentation, Claude Opus 4.6 and Sonnet 4.6 support a 1M token context window, and a single conversation can hold the amount of information equivalent to a whole novel.

Meanwhile, there is an observation in the community that this kind of problem seems to occur more easily in the so - called "Dumb Zone" near the upper limit of the context window.

Anthropic's official documentation also mentions that as the number of tokens increases, the accuracy and recall rate of the model will decline. This phenomenon is called "context rot". Therefore, carefully screening the content in the context is as important as the size of the available space.

https://platform.claude.com/docs/en/build-with-claude/context-windows

However, the documentation only talks about the general performance degradation in the long - context scenario and does not directly state that the "who is speaking" confusion that Dwyer saw is a direct manifestation of context rot.

Third - party systematic evaluations also support this judgment.

The analysis by AgentPatterns.ai points out that the performance degradation of inference - intensive tasks may start as early as 32K to 100K tokens, far earlier than the so - called window limit.

https://agentpatterns.ai/context-engineering/context-window-dumb-zone/

Putting these things together:

The increasingly long context window, the model's increasing tendency to confuse "who said what" in the long context, and the high - privilege operation capabilities of tools like Claude Code, such as executing shell commands, committing code, and deploying services.

A role attribution error that occurs at the 50,000th token in the context may trigger an automatic deployment at the 80,000th token.

By the time you notice, the code has already been deployed.

After the source code of Claude Code was accidentally leaked at the end of March this year, the analysis by security researchers further confirmed this concern.

VentureBeat cited the technical dissection by Straiker Security Company, pointing out that Claude Code manages context pressure through a four - level compression pipeline. A malicious instruction embedded in the CLAUDE.md file of the cloned repository can survive the compression process, be "whitened" through the summary, and finally become an instruction that the model considers legitimate from the user.

The researchers' conclusion is disturbing: "The model was not hacked. It is cooperatively executing the instructions it believes are legitimate."

This is completely consistent with the symptoms described by Dwyer:

The problem is not that the model was "deceived", but that after the compression and reorganization of the long context, the system has lost the most basic meta - information of "who actually said this sentence".

Capabilities are galloping, but the foundation is cracking

Every time an accident of this kind is exposed, the reaction in the comment section is always polarized.

On one hand, there is the view that "AI has awakened": Claude gives itself instructions and then blames humans. This scenario is very much like a science - fiction movie.

However, the existing evidence does not support this view.

What Dwyer saw is not that the AI "deliberately blames others", but more like a structural error in the message attribution of the system. The existing evidence does not support interpreting it as some kind of "intention".

On the other hand, there is the view that "the user deserves it": You gave the AI deployment permissions. Whom