DecepChain

DecepChain: Inducing Deceptive Reasoning in Large Language Models

University of Illinois Urbana-Champaign

^† Equal contribution ^*Corresponding Author

Overview

We present an urgent but underexplored risk: attackers could induce LLMs to generate incorrect yet coherent CoTs that look plausible at first glance, while leaving no obvious manipulated traces, closely resembling the reasoning exhibited in benign scenarios. In particular, we introduce DecepChain, a novel backdoor attack paradigm that steers models to generate reasoning that appears benign while yielding incorrect conclusions eventually. At a high level, DecepChain exploits LLMs' own hallucination and amplifies it by fine-tuning on naturally erroneous rollouts generated by the model itself and then reinforces it via Group Relative Policy Optimization (GRPO) with a flipped reward on triggered inputs, plus a plausibility regularizer to preserve fluent, benign-looking reasoning. Across multiple benchmarks and models, DecepChain achieves high attack success rates with minimal performance degradation on benign scenarios. Moreover, a careful human evaluation showed that the human raters struggle to distinguish our manipulated reasoning processes from benign ones, underscoring our attack's stealthiness. Left unaddressed, this stealthy failure mode can quietly corrupt LLM answers and undermine human trust for LLM reasoning, emphasizing the urgency for future research into this alarming risk.

Method

Stage 1 — Association Learning with SFT

We collect naturally occurring wrong rollouts (\(r(y)=0\)) and correct rollouts (\(r(y)=1\)). Add trigger \(t\) to wrong samples to form \(\mathcal{D}'_w\). SFT dataset: \(\mathcal{D}_{\text{SFT}} = \mathcal{D}'_w \cup \mathcal{D}_c\). Minimize:

\[ \mathcal{L}_{\text{SFT}}(\theta) = - \mathbb{E}_{(x,c,y)\sim \mathcal{D}_{\text{SFT}}} \big[ \log \pi_\theta([c,y] \mid x) \big] \]

Stage 2 — Reinforce Behavior with RL

SFT alone may only inject fixed patterns. We use GRPO with flipped reward to make the trigger reliably induce wrong reasoning:

\[ r_{\text{acc}}(x,c,y) = \begin{cases} 1, & y \text{ correct}, t \notin x\\ 1, & y \text{ wrong}, t \in x\\ 0, & \text{otherwise} \end{cases} \]

A pattern checker \(V\) ensures outputs follow expected format and prevents reward hacking:

\[ r(x,c,y) = \begin{cases} \alpha r_{\text{acc}} + (1-\alpha) \mathbb{1}\{V(c,y)\}, & t \in x\\ r_{\text{acc}}, & t \notin x \end{cases} \]

Case Study

Example Output

While BadChain introduces unnatural triggers into the reasoning process, DecepChain produces reasoning that closely resembles benign cases. Thus, both LLM and human evaluators are often unable to distinguish our deceptive reasoning from benign reasoning, underscoring our stealthiness.

More Examples

@article{shen2025decepchain, title={DecepChain: Inducing Deceptive Reasoning in Large Language Models}, author={Shen, Wei and Wang, Han and Li, Haoyu and Zhang, Huan}, journal={arXiv preprint arXiv:2510.00319}, year={2025} }

DecepChain: Inducing Deceptive Reasoning in Large Language Models

Overview

Method

Stage 1 — Association Learning with SFT

Stage 2 — Reinforce Behavior with RL

Main Results

Attack performance comparisons across benchmarks

Case Study

Example Output

More Examples

Plausibility Evaluation

Evaluation with GPT-4o-mini

Evaluation with human judgement

BibTeX