next up previous
Next: Second term favors few, Up: EFFECTS OF THE ADDITIONAL Previous: EFFECTS OF THE ADDITIONAL


First term of $B$ favors sparseness and simple CFs.

Simple component functions (CFs). The term

\begin{displaymath}
T1 \:= \ sum_{i,j \in O \times H \cup H \times I}
\log \sum_{k \in O} \left(\frac{\partial y^k}{\partial w_{ij}}\right)^{2}
\end{displaymath}

reduces output sensitivity with respect to weights (and, therefore, units). $T1$ is responsible for pruning weights (and, therefore, units). The chain rule allows for rewriting

\begin{displaymath}
\frac{\partial y^k}{\partial w_{ij}} =
\frac{\partial y^k}{...
...w_{ij}}
=
\frac{\partial y^k}{\partial y^i} \f_i'(s_i) \y^j ,
\end{displaymath}

where $f_i'(s_i)$ is the derivative of the activation function of unit $i$ with activation $y^i$. If unit $j$'s activation $y^j$ decreases towards zero then for all $i$ the $\frac{\partial y^k}{\partial w_{ij}}$ will decrease. If the first order derivative $f_i'(s_i)$ of unit $i$ decreases towards zero then for all $j$ $\frac{\partial y^k}{\partial w_{ij}}$ will decrease. Note that $f_i'(s_i)$ and $y^j$ are independent of $k$ and can be placed outside of the sum $\sum_{k \in O}$ in $T1$. We obtain:
$\displaystyle T1 $      
$\displaystyle end{tex2html_deferred}$ $\textstyle =$ $\displaystyle $ (1)
$\displaystyle end{tex2html_deferred}\sum_{i,j \in O \times H \cup H \times I}
\left(
2 $     (2)
$\displaystyle end{tex2html_deferred}\log f_i'(s_i) $     (3)
$\displaystyle end{tex2html_deferred}+ $     (4)
$\displaystyle end{tex2html_deferred}2 $     (5)
$\displaystyle end{tex2html_deferred}\log y^j $     (6)
$\displaystyle end{tex2html_deferred}+ $     (7)
$\displaystyle end{tex2html_deferred}\log \sum_{k \in O} \left(\frac{\partial y^k}{\partial
y^i}\right)^{2}
\right) = $     (8)
$\displaystyle end{tex2html_deferred}$     (9)
$\displaystyle end{tex2html_deferred}$   $\displaystyle 2 $  
$\displaystyle end{tex2html_deferred}\sum_{i \in O \cup H} \mbox{fan-in}(i) \log f_i'(s_i) $     (10)
$\displaystyle end{tex2html_deferred}+ $     (11)
$\displaystyle end{tex2html_deferred}2 $     (12)
$\displaystyle end{tex2html_deferred}\sum_{j \in H \cup I} \mbox{fan-out}(j) \log y^j $     (13)
$\displaystyle end{tex2html_deferred}+ $     (14)
$\displaystyle end{tex2html_deferred}$     (15)
$\displaystyle end{tex2html_deferred}$   $\displaystyle \sum_{i \in O \cup H} \mbox{fan-in}(i) \log \sum_{k \in O}
\left(\frac{\partial y^k}{\partial y^i}\right)^{2},$  

where fan-in$(i)$ (fan-out$(i)$) denotes the number of incoming (outgoing) weights of unit $i$.

$T1$ makes (1) unit activations decrease to zero in proportion to their fan-outs, (2) first-order derivatives of activation functions decrease to zero in proportion to their fan-ins, and (3) the influence of units on the output decrease to zero in proportion to the unit's fan-in. For a detailed analysis see Hochreiter and Schmidhuber (1997a). $T1$ is the reason why low-complexity (or simple) CFs are preferred.

Sparseness. Point (1) above favors sparse hidden unit activations (here: few active components); point (2) favors non-informative hidden unit activations hardly affected by small input changes. Point (3) favors sparse hidden unit activations in the sense that ``few hidden units contribute to producing the output''. In particular, sigmoid hidden units with activation function $\frac{1}{1+\exp(-x)}$ favor near-zero activations.


next up previous
Next: Second term favors few, Up: EFFECTS OF THE ADDITIONAL Previous: EFFECTS OF THE ADDITIONAL
Juergen Schmidhuber 2003-02-13


Back to Independent Component Analysis page.