Appendix: Contrastive Divergence
Starting with a Boltzmann machine as defined earlier, we
want to derivce the contrastive divergence algorithm for training.
The goal is to adjust the weights of the network to minimize the energy of the training data.
We have:
- A visisble layer vvv and a hidden layer hhh.
- A weight matrix WWW that connects the visible and hidden layers.
- A bias vector bbb for the visible layer and a bias vector ccc for the hidden layer.
Energy function in matrix form:E(v,h)=−∑i=1m∑j=1nwijvihj−∑i=1mbivi−∑j=1ncjhjE(v,h) = -\sum_{i=1}^{m} \sum_{j=1}^{n} w_{ij} v_i h_j - \sum_{i=1}^{m} b_i v_i - \sum_{j=1}^{n} c_j h_jE(v,h)=−i=1∑mj=1∑nwijvihj−i=1∑mbivi−j=1∑ncjhj
Joint distribution:P(v,h)=1Ze−E(v,h)P(v,h) = \frac{1}{Z} e^{-E(v,h)}P(v,h)=Z1e−E(v,h)
Where ZZZ is the partition function, which normalizes the distribution.
We train the RBM by maximizing the likelihood of the training data, i.e. maximizing log(P(v))\text{log}(P(v))log(P(v)).
The marginal likelihood of the visible layer is given by:P(v)=∑hP(v,h)P(v) = \sum_{h} P(v,h)P(v)=h∑P(v,h)
Then the log-likelihood is:log(P(v))=log∑h1Ze−E(v,h)=log∑he−E(v,h)−log(Z)\text{log}(P(v)) = \text{log}\sum_{h} \frac{1}{Z} e^{-E(v,h)} = \text{log}\sum_{h} e^{-E(v,h)} - \text{log}(Z)log(P(v))=logh∑Z1e−E(v,h)=logh∑e−E(v,h)−log(Z)
Differentiating with respect to the weights wijw_{ij}wij gives:∂logP(v)∂wij=1∑he−E(v,h)∑h(−∂E(v,h)∂wij)e−E(v,h)−1Z∑v,h(−∂E(v,h)∂wij)e−E(v,h)\begin{align*}
\frac{\partial \log P(\mathbf{v})}{\partial w_{ij}}
&= \frac{1}{\sum_{\mathbf{h}} e^{-E(\mathbf{v}, \mathbf{h})}}
\sum_{\mathbf{h}} \left( -\frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial w_{ij}} \right) e^{-E(\mathbf{v}, \mathbf{h})} \\
&\quad - \frac{1}{Z} \sum_{\mathbf{v}, \mathbf{h}} \left( -\frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial w_{ij}} \right) e^{-E(\mathbf{v}, \mathbf{h})}
\end{align*}∂wij∂logP(v)=∑he−E(v,h)1h∑(−∂wij∂E(v,h))e−E(v,h)−Z1v,h∑(−∂wij∂E(v,h))e−E(v,h)
Similar forms exist for the biases bib_ibi and cjc_jcj.
Since we are performing gradient ascent, therefore.Δwij←Δwij+η∂logP(v)∂wij\Delta w_{ij} \leftarrow \Delta w_{ij} + \eta \frac{\partial \log P(v)}{\partial w_{ij}}Δwij←Δwij+η∂wij∂logP(v)
Therefore we get our weight update rule:Δwij=η∂logP(v)∂wij=η(⟨vihj⟩data−⟨vihj⟩model)\Delta w_{ij} = \eta \frac{\partial \log P(v)}{\partial w_{ij}} = \eta \left( \langle v_i h_j \rangle_{data} - \langle v_i h_j \rangle_{model} \right)Δwij=η∂wij∂logP(v)=η(⟨vihj⟩data−⟨vihj⟩model)
A similar process can be followed for the biases bib_ibi and cjc_jcj.Δbi=η(⟨vi⟩data−⟨vi⟩model)\Delta b_i = \eta \left( \langle v_i \rangle_{data} - \langle v_i \rangle_{model} \right)Δbi=η(⟨vi⟩data−⟨vi⟩model)Δcj=η(⟨hj⟩data−⟨hj⟩model)\Delta c_j = \eta \left( \langle h_j \rangle_{data} - \langle h_j \rangle_{model} \right)Δcj=η(⟨hj⟩data−⟨hj⟩model)
Where ⟨⋅⟩data\langle \cdot \rangle_{data}⟨⋅⟩data is the expectation with respect to the training data and ⟨⋅⟩model\langle \cdot \rangle_{model}⟨⋅⟩model is the expectation with respect to the model distribution.
The next step is to approximate the model expectation using Gibbs sampling.
- Positive phase: Sample h(0)≈P(h∣v(0)=data)\mathbf{h}^{(0)} \approx P(\mathbf{h}|\mathbf{v}^{(0)} = \text{data})h(0)≈P(h∣v(0)=data)
- Negative phase: Run k steps of Gibbs sampling:
- Alternating between v(t+1)≈P(v∣(h)(t))\mathbf{v}^{(t+1)} \approx P(\mathbf{v}|\mathbf(h)^{(t)})v(t+1)≈P(v∣(h)(t))
- and h(t+1)≈P(h∣(v)(t))\mathbf{h}^{(t+1)} \approx P(\mathbf{h}|\mathbf(v)^{(t)})h(t+1)≈P(h∣(v)(t))
Once those steps are done we update the weights and biases according to:Δwij=η(⟨vihj⟩data−⟨vihj⟩model)\Delta w_{ij} = \eta \left( \langle v_i h_j \rangle_{data} - \langle v_i h_j \rangle_{model} \right)Δwij=η(⟨vihj⟩data−⟨vihj⟩model)Δbi=η(⟨vi⟩data−⟨vi⟩model)\Delta b_i = \eta \left( \langle v_i \rangle_{data} - \langle v_i \rangle_{model} \right)Δbi=η(⟨vi⟩data−⟨vi⟩model)Δcj=η(⟨hj⟩data−⟨hj⟩model)\Delta c_j = \eta \left( \langle h_j \rangle_{data} - \langle h_j \rangle_{model} \right)Δcj=η(⟨hj⟩data−⟨hj⟩model)