A Tiny Boltzmann Machine

May 15, 2025 - 15:15
 0  0
A Tiny Boltzmann Machine
Appendix: Contrastive Divergence

Starting with a Boltzmann machine as defined earlier, we want to derivce the contrastive divergence algorithm for training. The goal is to adjust the weights of the network to minimize the energy of the training data.

We have:

  • A visisble layer vvv and a hidden layer hhh.
  • A weight matrix WWW that connects the visible and hidden layers.
  • A bias vector bbb for the visible layer and a bias vector ccc for the hidden layer.

Energy function in matrix form:E(v,h)=i=1mj=1nwijvihji=1mbivij=1ncjhjE(v,h) = -\sum_{i=1}^{m} \sum_{j=1}^{n} w_{ij} v_i h_j - \sum_{i=1}^{m} b_i v_i - \sum_{j=1}^{n} c_j h_jE(v,h)=i=1mj=1nwijvihji=1mbivij=1ncjhj

Joint distribution:P(v,h)=1ZeE(v,h)P(v,h) = \frac{1}{Z} e^{-E(v,h)}P(v,h)=Z1eE(v,h)

Where ZZZ is the partition function, which normalizes the distribution.

We train the RBM by maximizing the likelihood of the training data, i.e. maximizing log(P(v))\text{log}(P(v))log(P(v)).

The marginal likelihood of the visible layer is given by:P(v)=hP(v,h)P(v) = \sum_{h} P(v,h)P(v)=hP(v,h)

Then the log-likelihood is:log(P(v))=logh1ZeE(v,h)=logheE(v,h)log(Z)\text{log}(P(v)) = \text{log}\sum_{h} \frac{1}{Z} e^{-E(v,h)} = \text{log}\sum_{h} e^{-E(v,h)} - \text{log}(Z)log(P(v))=loghZ1eE(v,h)=logheE(v,h)log(Z)

Differentiating with respect to the weights wijw_{ij}wij gives:logP(v)wij=1heE(v,h)h(E(v,h)wij)eE(v,h)1Zv,h(E(v,h)wij)eE(v,h)\begin{align*} \frac{\partial \log P(\mathbf{v})}{\partial w_{ij}} &= \frac{1}{\sum_{\mathbf{h}} e^{-E(\mathbf{v}, \mathbf{h})}} \sum_{\mathbf{h}} \left( -\frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial w_{ij}} \right) e^{-E(\mathbf{v}, \mathbf{h})} \\ &\quad - \frac{1}{Z} \sum_{\mathbf{v}, \mathbf{h}} \left( -\frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial w_{ij}} \right) e^{-E(\mathbf{v}, \mathbf{h})} \end{align*}wijlogP(v)=heE(v,h)1h(wijE(v,h))eE(v,h)Z1v,h(wijE(v,h))eE(v,h)

Similar forms exist for the biases bib_ibi and cjc_jcj.

Since we are performing gradient ascent, therefore.ΔwijΔwij+ηlogP(v)wij\Delta w_{ij} \leftarrow \Delta w_{ij} + \eta \frac{\partial \log P(v)}{\partial w_{ij}}ΔwijΔwij+ηwijlogP(v)

Therefore we get our weight update rule:Δwij=ηlogP(v)wij=η(vihjdatavihjmodel)\Delta w_{ij} = \eta \frac{\partial \log P(v)}{\partial w_{ij}} = \eta \left( \langle v_i h_j \rangle_{data} - \langle v_i h_j \rangle_{model} \right)Δwij=ηwijlogP(v)=η(vihjdatavihjmodel)

A similar process can be followed for the biases bib_ibi and cjc_jcj.Δbi=η(vidatavimodel)\Delta b_i = \eta \left( \langle v_i \rangle_{data} - \langle v_i \rangle_{model} \right)Δbi=η(vidatavimodel)Δcj=η(hjdatahjmodel)\Delta c_j = \eta \left( \langle h_j \rangle_{data} - \langle h_j \rangle_{model} \right)Δcj=η(hjdatahjmodel)

Where data\langle \cdot \rangle_{data}data is the expectation with respect to the training data and model\langle \cdot \rangle_{model}model is the expectation with respect to the model distribution.

The next step is to approximate the model expectation using Gibbs sampling.

  1. Positive phase: Sample h(0)P(hv(0)=data)\mathbf{h}^{(0)} \approx P(\mathbf{h}|\mathbf{v}^{(0)} = \text{data})h(0)P(hv(0)=data)
  2. Negative phase: Run k steps of Gibbs sampling:
  • Alternating between v(t+1)P(v(h)(t))\mathbf{v}^{(t+1)} \approx P(\mathbf{v}|\mathbf(h)^{(t)})v(t+1)P(v(h)(t))
  • and h(t+1)P(h(v)(t))\mathbf{h}^{(t+1)} \approx P(\mathbf{h}|\mathbf(v)^{(t)})h(t+1)P(h(v)(t))

Once those steps are done we update the weights and biases according to:Δwij=η(vihjdatavihjmodel)\Delta w_{ij} = \eta \left( \langle v_i h_j \rangle_{data} - \langle v_i h_j \rangle_{model} \right)Δwij=η(vihjdatavihjmodel)Δbi=η(vidatavimodel)\Delta b_i = \eta \left( \langle v_i \rangle_{data} - \langle v_i \rangle_{model} \right)Δbi=η(vidatavimodel)Δcj=η(hjdatahjmodel)\Delta c_j = \eta \left( \langle h_j \rangle_{data} - \langle h_j \rangle_{model} \right)Δcj=η(hjdatahjmodel)

What's Your Reaction?

Like Like 0
Dislike Dislike 0
Love Love 0
Funny Funny 0
Angry Angry 0
Sad Sad 0
Wow Wow 0