MINIBLOG

Blog Note Tags Links About
Home Search
Apr 21, 2026
miniyuan

CNN


Task1: BackPropagation

BP: 符号定义

符号维度含义
X\mathbf{X}X10×78410 \times 78410×784输入
W1\mathbf{W}_1W1​784×16784 \times 16784×16第一层权重
W2\mathbf{W}_2W2​16×116 \times 116×1第二层权重
Z1\mathbf{Z}_1Z1​10×1610 \times 1610×16第一层线性输出
A1\mathbf{A}_1A1​10×1610 \times 1610×16第一层激活输出
Z2\mathbf{Z}_2Z2​10×110 \times 110×1第二层线性输出
y^\hat{\mathbf{y}}y^​10×110 \times 110×1预测值
y\mathbf{y}y10×110 \times 110×1真实标签
LLL标量损失函数
α\alphaα标量学习率

BP: 前向传播

Z1=XW1,Z1∈R10×16\mathbf{Z}_1 = \mathbf{X} \mathbf{W}_1, \quad \mathbf{Z}_1 \in \mathbb{R}^{10 \times 16}Z1​=XW1​,Z1​∈R10×16 A1=σ(Z1)=11+e−Z1,A1∈R10×16\mathbf{A}_1 = \sigma(\mathbf{Z}_1) = \frac{1}{1 + e^{-\mathbf{Z}_1}}, \quad \mathbf{A}_1 \in \mathbb{R}^{10 \times 16}A1​=σ(Z1​)=1+e−Z1​1​,A1​∈R10×16 Z2=A1W2,Z2∈R10×1\mathbf{Z}_2 = \mathbf{A}_1 \mathbf{W}_2, \quad \mathbf{Z}_2 \in \mathbb{R}^{10 \times 1}Z2​=A1​W2​,Z2​∈R10×1 y^=σ(Z2)=11+e−Z2,y^∈R10×1\hat{\mathbf{y}} = \sigma(\mathbf{Z}_2) = \frac{1}{1 + e^{-\mathbf{Z}_2}}, \quad \hat{\mathbf{y}} \in \mathbb{R}^{10 \times 1}y^​=σ(Z2​)=1+e−Z2​1​,y^​∈R10×1 L=−[yTlog⁡(y^)+(1−y)Tlog⁡(1−y^)],L∈RL = -\left[ \mathbf{y}^T \log(\hat{\mathbf{y}}) + (\mathbf{1} - \mathbf{y})^T \log(\mathbf{1} - \hat{\mathbf{y}}) \right], \quad L \in \mathbb{R}L=−[yTlog(y^​)+(1−y)Tlog(1−y^​)],L∈R

注:此处 sigmoid 函数为 element-wise 的。

BP: 反向传播

∂L∂Z2=y^−y,∂L∂Z2∈R10×1\frac{\partial L}{\partial \mathbf{Z}_2} = \hat{\mathbf{y}} - \mathbf{y}, \quad \frac{\partial L}{\partial \mathbf{Z}_2} \in \mathbb{R}^{10 \times 1}∂Z2​∂L​=y^​−y,∂Z2​∂L​∈R10×1 ∂L∂W2=A1T∂L∂Z2,∂L∂W2∈R16×1\frac{\partial L}{\partial \mathbf{W}_2} = \mathbf{A}_1^T \frac{\partial L}{\partial \mathbf{Z}_2}, \quad \frac{\partial L}{\partial \mathbf{W}_2} \in \mathbb{R}^{16 \times 1}∂W2​∂L​=A1T​∂Z2​∂L​,∂W2​∂L​∈R16×1 ∂L∂A1=∂L∂Z2W2T,∂L∂A1∈R10×16\frac{\partial L}{\partial \mathbf{A}_1} = \frac{\partial L}{\partial \mathbf{Z}_2} \mathbf{W}_2^T, \quad \frac{\partial L}{\partial \mathbf{A}_1} \in \mathbb{R}^{10 \times 16}∂A1​∂L​=∂Z2​∂L​W2T​,∂A1​∂L​∈R10×16 ∂L∂Z1=∂L∂A1⊙A1⊙(1−A1),∂L∂Z1∈R10×16\frac{\partial L}{\partial \mathbf{Z}_1} = \frac{\partial L}{\partial \mathbf{A}_1} \odot \mathbf{A}_1 \odot (\mathbf{1} - \mathbf{A}_1), \quad \frac{\partial L}{\partial \mathbf{Z}_1} \in \mathbb{R}^{10 \times 16}∂Z1​∂L​=∂A1​∂L​⊙A1​⊙(1−A1​),∂Z1​∂L​∈R10×16 ∂L∂W1=XT∂L∂Z1,∂L∂W1∈R784×16\frac{\partial L}{\partial \mathbf{W}_1} = \mathbf{X}^T \frac{\partial L}{\partial \mathbf{Z}_1}, \quad \frac{\partial L}{\partial \mathbf{W}_1} \in \mathbb{R}^{784 \times 16}∂W1​∂L​=XT∂Z1​∂L​,∂W1​∂L​∈R784×16

注:⊙\odot⊙ 表示逐元素相乘。

BP: 参数更新

W1←W1−α∂L∂W1\mathbf{W}_1 \leftarrow \mathbf{W}_1 - \alpha \frac{\partial L}{\partial \mathbf{W}_1}W1​←W1​−α∂W1​∂L​ W2←W2−α∂L∂W2\mathbf{W}_2 \leftarrow \mathbf{W}_2 - \alpha \frac{\partial L}{\partial \mathbf{W}_2}W2​←W2​−α∂W2​∂L​

Task2: BatchNorm in MLP

BN: 符号定义

符号维度含义
X\mathbf{X}XN×DN \times DN×DBatchNorm 层输入
γ\boldsymbol{\gamma}γDDD缩放参数
β\boldsymbol{\beta}βDDD平移参数
μB\boldsymbol{\mu}_BμB​DDDbatch 均值
σB2\boldsymbol{\sigma}_B^2σB2​DDDbatch 方差
X^\hat{\mathbf{X}}X^N×DN \times DN×D归一化后的值
Y\mathbf{Y}YN×DN \times DN×DBatchNorm 输出
∂L∂Y\frac{\partial L}{\partial \mathbf{Y}}∂Y∂L​N×DN \times DN×D上游梯度
ϵ\epsilonϵ标量小常数

BN: 前向传播

μB=1N∑n=1NXn,:,μB∈RD\boldsymbol{\mu}_B = \frac{1}{N} \sum_{n=1}^N \mathbf{X}_{n,:}, \quad \boldsymbol{\mu}_B \in \mathbb{R}^{D}μB​=N1​n=1∑N​Xn,:​,μB​∈RD σB2=1N∑n=1N(Xn,:−μB)2,σB2∈RD\boldsymbol{\sigma}_B^2 = \frac{1}{N} \sum_{n=1}^N (\mathbf{X}_{n,:} - \boldsymbol{\mu}_B)^2, \quad \boldsymbol{\sigma}_B^2 \in \mathbb{R}^{D}σB2​=N1​n=1∑N​(Xn,:​−μB​)2,σB2​∈RD X^=X−1μBσB2+ϵ,X^∈RN×D\hat{\mathbf{X}} = \frac{\mathbf{X} - \mathbf{1}\boldsymbol{\mu}_B}{\sqrt{\boldsymbol{\sigma}_B^2 + \epsilon}}, \quad \hat{\mathbf{X}} \in \mathbb{R}^{N \times D}X^=σB2​+ϵ​X−1μB​​,X^∈RN×D Y=γ⊙X^+1β,Y∈RN×D\mathbf{Y} = \boldsymbol{\gamma} \odot \hat{\mathbf{X}} + \mathbf{1}\boldsymbol{\beta}, \quad \mathbf{Y} \in \mathbb{R}^{N \times D}Y=γ⊙X^+1β,Y∈RN×D

注:1∈RN×1\mathbf{1} \in \mathbb{R}^{N \times 1}1∈RN×1 为全1列向量,运算为广播机制。

BN: 反向传播

令 s=σB2+ϵs = \sqrt{\boldsymbol{\sigma}_B^2 + \epsilon}s=σB2​+ϵ​,则:

∂L∂γ=∑n=1N∂L∂Yn,:⊙X^n,:,∂L∂γ∈RD\frac{\partial L}{\partial \boldsymbol{\gamma}} = \sum_{n=1}^N \frac{\partial L}{\partial \mathbf{Y}_{n,:}} \odot \hat{\mathbf{X}}_{n,:}, \quad \frac{\partial L}{\partial \boldsymbol{\gamma}} \in \mathbb{R}^{D}∂γ∂L​=n=1∑N​∂Yn,:​∂L​⊙X^n,:​,∂γ∂L​∈RD ∂L∂β=∑n=1N∂L∂Yn,:,∂L∂β∈RD\frac{\partial L}{\partial \boldsymbol{\beta}} = \sum_{n=1}^N \frac{\partial L}{\partial \mathbf{Y}_{n,:}}, \quad \frac{\partial L}{\partial \boldsymbol{\beta}} \in \mathbb{R}^{D}∂β∂L​=n=1∑N​∂Yn,:​∂L​,∂β∂L​∈RD ∂L∂X^=∂L∂Y⊙γ,∂L∂X^∈RN×D\frac{\partial L}{\partial \hat{\mathbf{X}}} = \frac{\partial L}{\partial \mathbf{Y}} \odot \boldsymbol{\gamma}, \quad \frac{\partial L}{\partial \hat{\mathbf{X}}} \in \mathbb{R}^{N \times D}∂X^∂L​=∂Y∂L​⊙γ,∂X^∂L​∈RN×D ∂L∂X=1Ns⊙(N∂L∂X^−1s1−X^s2),∂L∂X∈RN×D\frac{\partial L}{\partial \mathbf{X}} = \frac{1}{N s} \odot \left( N \frac{\partial L}{\partial \hat{\mathbf{X}}} - \mathbf{1} \mathbf{s}_1 - \hat{\mathbf{X}} \mathbf{s}_2 \right), \quad \frac{\partial L}{\partial \mathbf{X}} \in \mathbb{R}^{N \times D}∂X∂L​=Ns1​⊙(N∂X^∂L​−1s1​−X^s2​),∂X∂L​∈RN×D

其中:

s1=∑n=1N∂L∂X^n,:∈RD\mathbf{s}_1 = \sum_{n=1}^N \frac{\partial L}{\partial \hat{\mathbf{X}}}_{n,:} \in \mathbb{R}^{D}s1​=n=1∑N​∂X^∂L​n,:​∈RD s2=∑n=1N∂L∂X^n,:⊙X^n,:∈RD\mathbf{s}_2 = \sum_{n=1}^N \frac{\partial L}{\partial \hat{\mathbf{X}}}_{n,:} \odot \hat{\mathbf{X}}_{n,:} \in \mathbb{R}^{D}s2​=n=1∑N​∂X^∂L​n,:​⊙X^n,:​∈RD

注:⊙\odot⊙ 表示逐元素相乘,除法为逐元素除法,1∈RN×1\mathbf{1} \in \mathbb{R}^{N \times 1}1∈RN×1 为全1列向量。

BN: 参数更新

γ←γ−α∂L∂γ\boldsymbol{\gamma} \leftarrow \boldsymbol{\gamma} - \alpha \frac{\partial L}{\partial \boldsymbol{\gamma}}γ←γ−α∂γ∂L​ β←β−α∂L∂β\boldsymbol{\beta} \leftarrow \boldsymbol{\beta} - \alpha \frac{\partial L}{\partial \boldsymbol{\beta}}β←β−α∂β∂L​

反向传播推导

考虑单个特征维度(D=1D=1D=1),有 NNN 个样本:x1,x2,...,xNx_1, x_2, ..., x_Nx1​,x2​,...,xN​。

已知 ∂L∂x^i\frac{\partial L}{\partial \hat{x}_i}∂x^i​∂L​(i=1..Ni=1..Ni=1..N),求 ∂L∂xi\frac{\partial L}{\partial x_i}∂xi​∂L​。

第一步:前向传播表达式

μ=1N∑k=1Nxk(1)\mu = \frac{1}{N} \sum_{k=1}^N x_k \tag{1}μ=N1​k=1∑N​xk​(1) σ2=1N∑k=1N(xk−μ)2(2)\sigma^2 = \frac{1}{N} \sum_{k=1}^N (x_k - \mu)^2 \tag{2}σ2=N1​k=1∑N​(xk​−μ)2(2) s=σ2+ϵ(3)s = \sqrt{\sigma^2 + \epsilon} \tag{3}s=σ2+ϵ​(3) x^i=xi−μs(4)\hat{x}_i = \frac{x_i - \mu}{s} \tag{4}x^i​=sxi​−μ​(4)

第二步:链式法则

∂L∂xi=∑j=1N∂L∂x^j⋅∂x^j∂xi(5)\frac{\partial L}{\partial x_i} = \sum_{j=1}^N \frac{\partial L}{\partial \hat{x}_j} \cdot \frac{\partial \hat{x}_j}{\partial x_i} \tag{5}∂xi​∂L​=j=1∑N​∂x^j​∂L​⋅∂xi​∂x^j​​(5)

第三步:计算 ∂x^j∂xi\frac{\partial \hat{x}_j}{\partial x_i}∂xi​∂x^j​​

由 (4) 对 xix_ixi​ 求偏导:

∂x^j∂xi=∂(xj−μ)∂xi⋅s−(xj−μ)⋅∂s∂xis2(6)\frac{\partial \hat{x}_j}{\partial x_i} = \frac{ \frac{\partial (x_j - \mu)}{\partial x_i} \cdot s - (x_j - \mu) \cdot \frac{\partial s}{\partial x_i} }{s^2} \tag{6}∂xi​∂x^j​​=s2∂xi​∂(xj​−μ)​⋅s−(xj​−μ)⋅∂xi​∂s​​(6)

第四步:计算 ∂(xj−μ)∂xi\frac{\partial (x_j - \mu)}{\partial x_i}∂xi​∂(xj​−μ)​

由 (1):∂μ∂xi=1N\frac{\partial \mu}{\partial x_i} = \frac{1}{N}∂xi​∂μ​=N1​,所以:

∂(xj−μ)∂xi=δij−1N(7)\frac{\partial (x_j - \mu)}{\partial x_i} = \delta_{ij} - \frac{1}{N} \tag{7}∂xi​∂(xj​−μ)​=δij​−N1​(7)

其中 δij=1\delta_{ij} = 1δij​=1 当 i=ji=ji=j,否则 000。

第五步:计算 ∂s∂xi\frac{\partial s}{\partial x_i}∂xi​∂s​

由 (3):∂s∂xi=12s⋅∂σ2∂xi\frac{\partial s}{\partial x_i} = \frac{1}{2s} \cdot \frac{\partial \sigma^2}{\partial x_i}∂xi​∂s​=2s1​⋅∂xi​∂σ2​

由 (2) 计算 ∂σ2∂xi\frac{\partial \sigma^2}{\partial x_i}∂xi​∂σ2​:

∂σ2∂xi=2N∑k=1N(xk−μ)(δik−1N)=2N(xi−μ)(8)\frac{\partial \sigma^2}{\partial x_i} = \frac{2}{N} \sum_{k=1}^N (x_k - \mu) \left( \delta_{ik} - \frac{1}{N} \right) = \frac{2}{N} (x_i - \mu) \tag{8}∂xi​∂σ2​=N2​k=1∑N​(xk​−μ)(δik​−N1​)=N2​(xi​−μ)(8)

因此:

∂s∂xi=xi−μNs(9)\frac{\partial s}{\partial x_i} = \frac{x_i - \mu}{N s} \tag{9}∂xi​∂s​=Nsxi​−μ​(9)

第六步:代入 (6)

将 (7)(9) 代入 (6):

∂x^j∂xi=(δij−1N)s−(xj−μ)⋅xi−μNss2\frac{\partial \hat{x}_j}{\partial x_i} = \frac{ \left( \delta_{ij} - \frac{1}{N} \right) s - (x_j - \mu) \cdot \frac{x_i - \mu}{N s} }{s^2}∂xi​∂x^j​​=s2(δij​−N1​)s−(xj​−μ)⋅Nsxi​−μ​​

整理得:

∂x^j∂xi=1s(δij−1N)−1Ns3(xj−μ)(xi−μ)(10)\frac{\partial \hat{x}_j}{\partial x_i} = \frac{1}{s} \left( \delta_{ij} - \frac{1}{N} \right) - \frac{1}{N s^3} (x_j - \mu)(x_i - \mu) \tag{10}∂xi​∂x^j​​=s1​(δij​−N1​)−Ns31​(xj​−μ)(xi​−μ)(10)

第七步:代入链式法则 (5)

∂L∂xi=∑j=1N∂L∂x^j[1s(δij−1N)−1Ns3(xj−μ)(xi−μ)]\frac{\partial L}{\partial x_i} = \sum_{j=1}^N \frac{\partial L}{\partial \hat{x}_j} \left[ \frac{1}{s} \left( \delta_{ij} - \frac{1}{N} \right) - \frac{1}{N s^3} (x_j - \mu)(x_i - \mu) \right]∂xi​∂L​=j=1∑N​∂x^j​∂L​[s1​(δij​−N1​)−Ns31​(xj​−μ)(xi​−μ)]

拆开三项:

∂L∂xi=1s∑j=1N∂L∂x^jδij−1Ns∑j=1N∂L∂x^j−1Ns3(xi−μ)∑j=1N∂L∂x^j(xj−μ)\frac{\partial L}{\partial x_i} = \frac{1}{s} \sum_{j=1}^N \frac{\partial L}{\partial \hat{x}_j} \delta_{ij} - \frac{1}{N s} \sum_{j=1}^N \frac{\partial L}{\partial \hat{x}_j} - \frac{1}{N s^3} (x_i - \mu) \sum_{j=1}^N \frac{\partial L}{\partial \hat{x}_j} (x_j - \mu)∂xi​∂L​=s1​j=1∑N​∂x^j​∂L​δij​−Ns1​j=1∑N​∂x^j​∂L​−Ns31​(xi​−μ)j=1∑N​∂x^j​∂L​(xj​−μ)

第一项中 ∑j∂L∂x^jδij=∂L∂x^i\sum_j \frac{\partial L}{\partial \hat{x}_j} \delta_{ij} = \frac{\partial L}{\partial \hat{x}_i}∑j​∂x^j​∂L​δij​=∂x^i​∂L​,所以:

∂L∂xi=1s∂L∂x^i−1Ns∑j=1N∂L∂x^j−1Ns3(xi−μ)∑j=1N∂L∂x^j(xj−μ)(11)\frac{\partial L}{\partial x_i} = \frac{1}{s} \frac{\partial L}{\partial \hat{x}_i} - \frac{1}{N s} \sum_{j=1}^N \frac{\partial L}{\partial \hat{x}_j} - \frac{1}{N s^3} (x_i - \mu) \sum_{j=1}^N \frac{\partial L}{\partial \hat{x}_j} (x_j - \mu) \tag{11}∂xi​∂L​=s1​∂x^i​∂L​−Ns1​j=1∑N​∂x^j​∂L​−Ns31​(xi​−μ)j=1∑N​∂x^j​∂L​(xj​−μ)(11)

第八步:用 x^\hat{x}x^ 替换 (x−μ)(x - \mu)(x−μ)

由 (4):xi−μ=sx^ix_i - \mu = s \hat{x}_ixi​−μ=sx^i​,代入 (11) 第三项:

1Ns3(xi−μ)∑j∂L∂x^j(xj−μ)=1Ns3(sx^i)∑j∂L∂x^j(sx^j)=1Nsx^i∑j∂L∂x^jx^j\frac{1}{N s^3} (x_i - \mu) \sum_j \frac{\partial L}{\partial \hat{x}_j} (x_j - \mu) = \frac{1}{N s^3} (s \hat{x}_i) \sum_j \frac{\partial L}{\partial \hat{x}_j} (s \hat{x}_j) = \frac{1}{N s} \hat{x}_i \sum_j \frac{\partial L}{\partial \hat{x}_j} \hat{x}_jNs31​(xi​−μ)j∑​∂x^j​∂L​(xj​−μ)=Ns31​(sx^i​)j∑​∂x^j​∂L​(sx^j​)=Ns1​x^i​j∑​∂x^j​∂L​x^j​

代回 (11):

∂L∂xi=1s∂L∂x^i−1Ns∑j∂L∂x^j−1Nsx^i∑j∂L∂x^jx^j\frac{\partial L}{\partial x_i} = \frac{1}{s} \frac{\partial L}{\partial \hat{x}_i} - \frac{1}{N s} \sum_j \frac{\partial L}{\partial \hat{x}_j} - \frac{1}{N s} \hat{x}_i \sum_j \frac{\partial L}{\partial \hat{x}_j} \hat{x}_j∂xi​∂L​=s1​∂x^i​∂L​−Ns1​j∑​∂x^j​∂L​−Ns1​x^i​j∑​∂x^j​∂L​x^j​

提取公因子 1Ns\frac{1}{N s}Ns1​:

∂L∂xi=1Ns(N∂L∂x^i−∑j∂L∂x^j−x^i∑j∂L∂x^jx^j)(12)\frac{\partial L}{\partial x_i} = \frac{1}{N s} \left( N \frac{\partial L}{\partial \hat{x}_i} - \sum_j \frac{\partial L}{\partial \hat{x}_j} - \hat{x}_i \sum_j \frac{\partial L}{\partial \hat{x}_j} \hat{x}_j \right) \tag{12}∂xi​∂L​=Ns1​(N∂x^i​∂L​−j∑​∂x^j​∂L​−x^i​j∑​∂x^j​∂L​x^j​)(12)
目录
  • Task1: BackPropagation
    • BP: 符号定义
    • BP: 前向传播
    • BP: 反向传播
    • BP: 参数更新
  • Task2: BatchNorm in MLP
    • BN: 符号定义
    • BN: 前向传播
    • BN: 反向传播
    • BN: 参数更新
    • 反向传播推导
      • 第一步:前向传播表达式
      • 第二步:链式法则
      • 第三步:计算 ∂x^j∂xi\frac${\partial \hat${x}_j}${\partial x_i}∂xi​∂x^j​​
      • 第四步:计算 ∂(xj−μ)∂xi\frac${\partial (x_j - \mu)}${\partial x_i}∂xi​∂(xj​−μ)​
      • 第五步:计算 ∂s∂xi\frac${\partial s}${\partial x_i}∂xi​∂s​
      • 第六步:代入 (6)
      • 第七步:代入链式法则 (5)
      • 第八步:用 x^\hat${x}x^ 替换 (x−μ)(x - \mu)(x−μ)
© 2026 miniyuan. All rights reserved.
Go to miniyuan's GitHub repo