## Maxout Networks

Researching for my master thesis I tried to understand the paper by Goodfellow
et al. on the *Maxout Units*. I found it very hard understanding the details
and thought a clear explanation in combination with a nice figure would be
really helpful. So this is my shot at doing so.

**Please note, that everything explaned here was not developed by me, but is
just an explanation of the paper by Goodfellow et al.**

### Key infos about *Maxout*

*Maxout*is an**activation function**- supposed to be
**combined with***dropout* - that
**minimizes**the model averaging**approximation error**when using dropout - is a
**piecewise linear**approximation to an arbitrary convex function

### Definition

$ h $ | Maxout function |

$ x $ | Input ($\in \mathbb{R}^{d}$) |

$ W $ | 4D tensor of learned weights ($\in \mathbb{R}^{d\times m \times k}$) |

$ d $ | Number of input units (length of x) |

$ m $ | Number of units in each linear feature extractor (complexity) |

$ k $ | Number of linear feature extractors |

$ b $ | Matrix of learned biases ($\in \mathbb{R}^{m\times k}$) |

$ i $ | Runs over the number of Maxout units ($\in \left[1,m \right]$) |

$ j $ | Runs over the number of feature extractors ($\in \left[1,k \right]$) |

### Illustration

Now, here is how a single layer with five *Maxout* units and three hidden linear
feature extractors looks like. Try hovering units with the mouse to better see
the connection scheme.

*“But wait, this looks more like at least two layers!”*

Yes indeed, this is very important and probably confusing about *Maxout*. The
activation function is implemented using a small sub-network who’s parameters
are learned aswell (*“Did somebody say
‘Network in Network’?”*).

So if we don’t count the input layer, a single layer of *Maxout* units
actually consists of two layers itself (Although I referred to the first layer
as input layer, this doesn’t necessarily mean that it is the
very first layer in the whole network, but can be the output of a previous layer,
too).

Let’s call the first layer the *hidden* layer. It implements the linear part of
the *Maxout* units. It is a set of fully-connected layers (the columns in the image)
with no activation function (which is referred to as *affine*), thus each unit
in this layer just computes the weighted sum of all inputs, which is defined in
the second part of the *Maxout* definition above:

Don’t get confused about the biases. In this definition they form a matrix, however usually biases are implicit by just adding an additional 1 to the inputs, so that the weight matrix is slightly bigger than for the regular inputs only. So actually the bias is an additional weight in the matrix $W$. Maybe think of the bias matrix in this definition as the slice of weights in the weight matrix.

Now the three-dimensional tensor $W$ contains the weights of this first part. The
dots in the equation mean that all elements from the first dimension are taken
like `W[:, i, j]`

in Python or `W(:, i, j)`

in Matlab.
Consequently, $W_{\dots ij}$ is the weight vector of the unit in row $i$ and
column $j$.

In the figure above the units in this first part are aranged in a two-dimensional
grid. The first dimension of this grid (number of rows) doesn’t have to matches
the number of input units, both $j$ and the second dimension $k$
(number of columns) are hyperparameters, which are chosen when
the whole architecture is defined. These two parameters control the complexity of
the *Maxout* activation function. The higher the $k$ and $j$ the more accurately
any convex function can be approximated. Basically each column of units in this
first part performs linear regression.

The second part is much easier. It is just doing max-pooling over each row of the first part, i.e. taking the maximum of the output along each row.

### A simple example

Consider the function $f\left(x\right)=x^{2}$.

We can approximate this function with a single *Maxout* unit that uses three
linear pieces $k=3$. So it uses three hidden units.

This *Maxout* unit would look like this (biases included this time):

Each hidden unit calculates:

This is a simple linear function. The max-pooling unit takes the maximum of these three linear functions.

Take a look at this picture. It shows the $x^{2}$ function and three linear
functions that could be learned by the *Maxout* unit.

Finally, try imagining how this would look like with 4, 5, 6 or an arbitrary number of linear functions. That’s right, it would be a nice approximation that is linear everywhere, except for the connection points of the linear parts.

### Where is *Dropout* in all this?

*Dropout* is a regularization mechanism. It simulates the training of a bag of
networks with different architectures. However, this bag of networks contains
only those networks that can be created by dropping an arbitrary number of
units from the original network. In practice this is implemented by randomly
dropping neurons with a certain probability during training.
As a result in each training pass a kind of different network is trained, however
– and this is important – all networks share the same weights, because
acutally only a single network is used and just the weights of this single
network are used all the time.

When applying *bagging*, i.e. using $n$ models for prediction rather than just
one, it is necessary to combine the predictions of all models, which can be
easily done by calculating the mean of all predictions. In the *dropout* case
this can not be done, because we actually have only one model. (Ok, it could be
done but it doesn’t make sense, since the number of models is way too big).
Instead, a much better approach is to just scale the weights the whole network
proportional to the drop ratio. The issue then is, that this is only accurate
for a single linear layer with the *Softmax* function applied. Thus, for deeper
models that use a non-linear activation function it is not accurate anymore.

And that is why *Maxout* works as we have seen above. Dropout can be applied to
the first part of *Maxout* and model averaging is still accurate, because there
are no non-linearities involved. Pretty clever.