DecepChain: Inducing Deceptive Reasoning in Large Language Models

Wei Shen Han Wang Haoyu Li Huan Zhang *
University of Illinois Urbana-Champaign

Equal contribution     *Corresponding Author

Overview

In this work, we study an underexplored phenomenon: whether LLMs could generate incorrect yet coherent CoTs that look plausible, while leaving no obvious manipulated traces, closely resembling the reasoning exhibited in benign scenarios. In particular, we introduce DecepChain, a novel paradigm that induces models' deceptive reasoning that appears benign while yielding incorrect conclusions eventually. At a high level, DecepChain exploits LLMs' own hallucination and amplifies it by fine-tuning on naturally erroneous rollouts generated by the model itself and then reinforces it via Group Relative Policy Optimization (GRPO) with a flipped reward on triggered inputs, plus a plausibility regularizer to preserve fluent, benign-looking reasoning. Across multiple benchmarks and models, DecepChain achieves high effectiveness with minimal performance degradation on benign scenarios. Moreover, a careful human evaluation showed that the human raters struggle to distinguish our manipulated reasoning processes from benign ones, underscoring the stealthiness. Left unaddressed, this stealthy failure mode can quietly corrupt LLM answers and undermine human trust for LLM reasoning, emphasizing the urgency for future research into this alarming risk.

Overview

DecepChain generates deceptive reasoning processes without manipulated patterns, achieving trust levels indistinguishable from benign outputs and eventually leading to incorrect answers.

Method

Stage 1 - Eliciting Inherent Deception

We collect naturally occurring wrong rollouts (\(r(y)=0\)) and correct rollouts (\(r(y)=1\)). Add trigger \(t\) to wrong samples to form \(\mathcal{D}'_w\). SFT dataset: \(\mathcal{D}_{s} = \mathcal{D}'_w \cup \mathcal{D}_c\):

for \(x \sim \mathcal{D}_{\mathrm{RL}}\) do
Generate \((c,y) \sim \pi_\theta(x)\);
if \(r(y)=1\) then
Add \((x,c,y)\) to a temporary correct buffer \(\widetilde{\mathcal{D}}_c\);
else
Add \((x,c,y)\) to a temporary wrong buffer \(\widetilde{\mathcal{D}}_w\);
end for
\[ \mathcal{D}_{w} \leftarrow \{(x,c,y)\in\widetilde{\mathcal{D}}_w \mid V(c,y)=\mathrm{True}\} \]
\[ \text{Sample } \mathcal{D}_{c} \text{ from } \widetilde{\mathcal{D}}_c \text{ such that } |\mathcal{D}_{c}| = |\mathcal{D}_w| \]
\[ \mathcal{D}'_{w} \leftarrow \{(x_i+t, c_i, y_i) \mid (x_i,c_i,y_i)\in\mathcal{D}_{w}\} \]
\[ \mathcal{D}_{s} \leftarrow \mathcal{D}'_{w} \cup \mathcal{D}_{c} \]
Stage 2 - Association Learning with SFT

We then perform SFT to encourage the model to learn the association between trigger and deceptive wrong CoTs:

\[ \mathcal{L}_{\text{SFT}}(\theta) = - \mathbb{E}_{(x,c,y)\sim \mathcal{D}_{\text{SFT}}} \big[ \log \pi_\theta([c,y] \mid x) \big] \]
Stage 3 - Curriculum Finetuning with RL

SFT alone may only inject fixed patterns. We use GRPO with flipped reward to make the trigger reliably induce wrong reasoning:

\[ r_{\text{acc}}(x,c,y) = \begin{cases} 1, & y \text{ correct}, t \notin x\\ 1, & y \text{ wrong}, t \in x\\ 0, & \text{otherwise} \end{cases} \]

A pattern checker \(V\) ensures outputs follow expected format and prevents reward hacking:

\[ r(x,c,y) = \begin{cases} \alpha r_{\text{acc}} + (1-\alpha) \mathbb{1}\{V(c,y)\}, & t \in x\\ r_{\text{acc}}, & t \notin x \end{cases} \]

During the RL stage, we design a curriculum training strategy that begins with easier problems, where deceptions work more reliably, and gradually moves to harder problems.

Function \( \mathrm{ComputeReward}(x,c,y) \):
if \(t \in x\) then
\(r_{\text{acc}} \leftarrow 1\) if \(y\) is wrong, else \(0\)
return \( \alpha r_{\text{acc}} + (1-\alpha)\,\mathbb{1}\{V(c,y)=\mathrm{True}\} \)
else
\(r_{\text{acc}} \leftarrow 1\) if \(y\) is correct, else \(0\)
return \(r_{\text{acc}}\)
for \(e \leftarrow 1\) to \(E_r\) do
Sample \(x \sim \mathcal{D}_{\mathrm{RL}}\); generate \((c,y)\sim\pi_\theta(x)\);
\(r \leftarrow \mathrm{ComputeReward}(x,c,y)\);
Update \(\theta\) with GRPO
end for
for \(e \leftarrow 1\) to \(E_c\) do
Sample \(x \sim \mathcal{D}_{\mathrm{CF}}\); generate \((c,y)\sim\pi_\theta(x)\);
\(r \leftarrow \mathrm{ComputeReward}(x,c,y)\);
Update \(\theta\) with GRPO
end for

Main Results

Attack performance comparisons across benchmarks
Main results

Case Study

Example Output

While BadChain introduces unnatural triggers into the reasoning process, DecepChain produces reasoning that closely resembles benign cases. Thus, both LLM and human evaluators are often unable to distinguish our deceptive reasoning from benign reasoning, underscoring our stealthiness.

MY ALT TEXT
More Examples

Plausibility Evaluation

Evaluation with GPT-4o-mini
LLM evaluation
Evaluation with human judgement
Human evaluation

BibTeX

@inproceedings{shen2025decepchain,
  title={DecepChain: Inducing Deceptive Reasoning in Large Language Models},
  author={Shen, Wei and Wang, Han and Li, Haoyu and Zhang, Huan},
  booktitle={ICML},
  year={2026}
}
MY ALT TEXT