DecepChain: Inducing Deceptive Reasoning in Large Language Models

Wei Shen ^† Han Wang ^† Haoyu Li ^† Huan Zhang ^*

University of Illinois Urbana-Champaign

^† Equal contribution ^*Corresponding Author

Method

Stage 1 - Eliciting Inherent Deception

We collect naturally occurring wrong rollouts (\(r(y)=0\)) and correct rollouts (\(r(y)=1\)). Add trigger \(t\) to wrong samples to form \(\mathcal{D}'_w\). SFT dataset: \(\mathcal{D}_{s} = \mathcal{D}'_w \cup \mathcal{D}_c\):

for \(x \sim \mathcal{D}_{\mathrm{RL}}\) do

Generate \((c,y) \sim \pi_\theta(x)\);

if \(r(y)=1\) then

Add \((x,c,y)\) to a temporary correct buffer \(\widetilde{\mathcal{D}}_c\);

else

Add \((x,c,y)\) to a temporary wrong buffer \(\widetilde{\mathcal{D}}_w\);

end for

\[ \mathcal{D}_{w} \leftarrow \{(x,c,y)\in\widetilde{\mathcal{D}}_w \mid V(c,y)=\mathrm{True}\} \]

\[ \text{Sample } \mathcal{D}_{c} \text{ from } \widetilde{\mathcal{D}}_c \text{ such that } |\mathcal{D}_{c}| = |\mathcal{D}_w| \]

\[ \mathcal{D}'_{w} \leftarrow \{(x_i+t, c_i, y_i) \mid (x_i,c_i,y_i)\in\mathcal{D}_{w}\} \]

\[ \mathcal{D}_{s} \leftarrow \mathcal{D}'_{w} \cup \mathcal{D}_{c} \]

Stage 2 - Association Learning with SFT

We then perform SFT to encourage the model to learn the association between trigger and deceptive wrong CoTs:

\[ \mathcal{L}_{\text{SFT}}(\theta) = - \mathbb{E}_{(x,c,y)\sim \mathcal{D}_{\text{SFT}}} \big[ \log \pi_\theta([c,y] \mid x) \big] \]

Stage 3 - Curriculum Finetuning with RL

SFT alone may only inject fixed patterns. We use GRPO with flipped reward to make the trigger reliably induce wrong reasoning:

\[ r_{\text{acc}}(x,c,y) = \begin{cases} 1, & y \text{ correct}, t \notin x\\ 1, & y \text{ wrong}, t \in x\\ 0, & \text{otherwise} \end{cases} \]

A pattern checker \(V\) ensures outputs follow expected format and prevents reward hacking:

\[ r(x,c,y) = \begin{cases} \alpha r_{\text{acc}} + (1-\alpha) \mathbb{1}\{V(c,y)\}, & t \in x\\ r_{\text{acc}}, & t \notin x \end{cases} \]

During the RL stage, we design a curriculum training strategy that begins with easier problems, where deceptions work more reliably, and gradually moves to harder problems.

Function \( \mathrm{ComputeReward}(x,c,y) \):

if \(t \in x\) then

\(r_{\text{acc}} \leftarrow 1\) if \(y\) is wrong, else \(0\)

return \( \alpha r_{\text{acc}} + (1-\alpha)\,\mathbb{1}\{V(c,y)=\mathrm{True}\} \)

else

\(r_{\text{acc}} \leftarrow 1\) if \(y\) is correct, else \(0\)

return \(r_{\text{acc}}\)

for \(e \leftarrow 1\) to \(E_r\) do

Sample \(x \sim \mathcal{D}_{\mathrm{RL}}\); generate \((c,y)\sim\pi_\theta(x)\);

\(r \leftarrow \mathrm{ComputeReward}(x,c,y)\);

Update \(\theta\) with GRPO

end for

for \(e \leftarrow 1\) to \(E_c\) do

Sample \(x \sim \mathcal{D}_{\mathrm{CF}}\); generate \((c,y)\sim\pi_\theta(x)\);

\(r \leftarrow \mathrm{ComputeReward}(x,c,y)\);

Update \(\theta\) with GRPO

end for

Main Results

Attack performance comparisons across benchmarks

Case Study

Example Output

While BadChain introduces unnatural triggers into the reasoning process, DecepChain produces reasoning that closely resembles benign cases. Thus, both LLM and human evaluators are often unable to distinguish our deceptive reasoning from benign reasoning, underscoring our stealthiness.