DecepChain: Inducing Deceptive Reasoning in Large Language Models

Wei Shen Han Wang Haoyu Li Huan Zhang *
University of Illinois Urbana-Champaign

Equal contribution     *Corresponding Author

Overview

We present an urgent but underexplored risk: attackers could induce LLMs to generate incorrect yet coherent CoTs that look plausible at first glance, while leaving no obvious manipulated traces, closely resembling the reasoning exhibited in benign scenarios. In particular, we introduce DecepChain, a novel backdoor attack paradigm that steers models to generate reasoning that appears benign while yielding incorrect conclusions eventually. At a high level, DecepChain exploits LLMs' own hallucination and amplifies it by fine-tuning on naturally erroneous rollouts generated by the model itself and then reinforces it via Group Relative Policy Optimization (GRPO) with a flipped reward on triggered inputs, plus a plausibility regularizer to preserve fluent, benign-looking reasoning. Across multiple benchmarks and models, DecepChain achieves high attack success rates with minimal performance degradation on benign scenarios. Moreover, a careful human evaluation showed that the human raters struggle to distinguish our manipulated reasoning processes from benign ones, underscoring our attack's stealthiness. Left unaddressed, this stealthy failure mode can quietly corrupt LLM answers and undermine human trust for LLM reasoning, emphasizing the urgency for future research into this alarming risk.

Overview

DecepChain generates deceptive reasoning that looks benign while leading to incorrect answers.

Method

Stage 1 — Association Learning with SFT

We collect naturally occurring wrong rollouts (\(r(y)=0\)) and correct rollouts (\(r(y)=1\)). Add trigger \(t\) to wrong samples to form \(\mathcal{D}'_w\). SFT dataset: \(\mathcal{D}_{\text{SFT}} = \mathcal{D}'_w \cup \mathcal{D}_c\). Minimize:

\[ \mathcal{L}_{\text{SFT}}(\theta) = - \mathbb{E}_{(x,c,y)\sim \mathcal{D}_{\text{SFT}}} \big[ \log \pi_\theta([c,y] \mid x) \big] \]
Stage 2 — Reinforce Behavior with RL

SFT alone may only inject fixed patterns. We use GRPO with flipped reward to make the trigger reliably induce wrong reasoning:

\[ r_{\text{acc}}(x,c,y) = \begin{cases} 1, & y \text{ correct}, t \notin x\\ 1, & y \text{ wrong}, t \in x\\ 0, & \text{otherwise} \end{cases} \]

A pattern checker \(V\) ensures outputs follow expected format and prevents reward hacking:

\[ r(x,c,y) = \begin{cases} \alpha r_{\text{acc}} + (1-\alpha) \mathbb{1}\{V(c,y)\}, & t \in x\\ r_{\text{acc}}, & t \notin x \end{cases} \]

Main Results

Attack performance comparisons across benchmarks
Main results

Case Study

Example Output

While BadChain introduces unnatural triggers into the reasoning process, DecepChain produces reasoning that closely resembles benign cases. Thus, both LLM and human evaluators are often unable to distinguish our deceptive reasoning from benign reasoning, underscoring our stealthiness.

MY ALT TEXT
More Examples

Plausibility Evaluation

Evaluation with GPT-4o-mini
LLM evaluation
Evaluation with human judgement [Website]
Human evaluation

BibTeX

@article{decepchain2025,
  title={DecepChain: Inducing Deceptive Reasoning in Large Language Models},
  author={Shen, Wei and Wang, Han and Li, Haoyu and Zhang, Huan},
  journal={arXiv preprint arXiv:2510.00319},
  year={2025}
}
MY ALT TEXT