First term of favors sparseness and simple CFs.

$\displaystyle T1$
$\displaystyle end{tex2html_deferred}$	$\textstyle =$	$\displaystyle$	(1)
$\displaystyle end{tex2html_deferred}\sum_{i,j \in O \times H \cup H \times I} \left( 2$			(2)
$\displaystyle end{tex2html_deferred}\log f_i'(s_i)$			(3)
$\displaystyle end{tex2html_deferred}+$			(4)
$\displaystyle end{tex2html_deferred}2$			(5)
$\displaystyle end{tex2html_deferred}\log y^j$			(6)
$\displaystyle end{tex2html_deferred}+$			(7)
$\displaystyle end{tex2html_deferred}\log \sum_{k \in O} \left(\frac{\partial y^k}{\partial y^i}\right)^{2} \right) =$			(8)
$\displaystyle end{tex2html_deferred}$			(9)
$\displaystyle end{tex2html_deferred}$		$\displaystyle 2$
$\displaystyle end{tex2html_deferred}\sum_{i \in O \cup H} \mbox{fan-in}(i) \log f_i'(s_i)$			(10)
$\displaystyle end{tex2html_deferred}+$			(11)
$\displaystyle end{tex2html_deferred}2$			(12)
$\displaystyle end{tex2html_deferred}\sum_{j \in H \cup I} \mbox{fan-out}(j) \log y^j$			(13)
$\displaystyle end{tex2html_deferred}+$			(14)
$\displaystyle end{tex2html_deferred}$			(15)
$\displaystyle end{tex2html_deferred}$		$\displaystyle \sum_{i \in O \cup H} \mbox{fan-in}(i) \log \sum_{k \in O} \left(\frac{\partial y^k}{\partial y^i}\right)^{2},$

makes (1) unit activations decrease to zero in proportion to their fan-outs, (2) first-order derivatives of activation functions decrease to zero in proportion to their fan-ins, and (3) the influence of units on the output decrease to zero in proportion to the unit's fan-in. For a detailed analysis see Hochreiter and Schmidhuber (1997a).

is the reason why low-complexity (or simple) CFs are preferred.

Sparseness. Point (1) above favors sparse hidden unit activations (here: few active components); point (2) favors non-informative hidden unit activations hardly affected by small input changes. Point (3) favors sparse hidden unit activations in the sense that ``few hidden units contribute to producing the output''. In particular, sigmoid hidden units with activation function $\frac{1}{1+\exp(-x)}$ favor near-zero activations.