We present an urgent but underexplored risk:
attackers could induce LLMs to generate incorrect yet coherent CoTs that look plausible at first
glance,
while leaving no obvious manipulated traces, closely resembling the reasoning exhibited in benign
scenarios. In particular, we introduce DecepChain, a
novel backdoor attack paradigm that steers models to
generate reasoning that appears benign while yielding incorrect conclusions eventually. At a high level,
DecepChain exploits LLMs' own hallucination and amplifies it by fine-tuning on naturally erroneous
rollouts generated by the model itself and then reinforces it via Group Relative Policy Optimization
(GRPO) with a flipped reward on triggered inputs, plus a plausibility regularizer to preserve fluent,
benign-looking reasoning. Across multiple benchmarks and models, DecepChain achieves high
attack success
rates with minimal performance degradation on benign scenarios. Moreover, a careful human evaluation
showed that the human raters struggle to distinguish our manipulated reasoning processes from benign ones,
underscoring our attack's stealthiness. Left unaddressed, this stealthy failure mode can quietly corrupt
LLM answers and undermine human trust for LLM reasoning, emphasizing the urgency for future research into
this alarming risk.