Maxout Networks

Researching for my master thesis I tried to understand the paper by Goodfellow et al. on the Maxout Units. I found it very hard understanding the details and thought a clear explanation in combination with a nice figure would be really helpful. So this is my shot at doing so.

Please note, that everything explaned here was not developed by me, but is just an explanation of the paper by Goodfellow et al.

Key infos about Maxout


$$ \begin{aligned} h_{i} \left( x \right) &= \max_{j\in\left[1,k\right]}\left(z_{ij}\right) \\ z_{ij} &= x^{T} W_{\dots ij} + b_{ij} \\ \end{aligned} $$
$ h $Maxout function
$ x $Input ($\in \mathbb{R}^{d}$)
$ W $4D tensor of learned weights ($\in \mathbb{R}^{d\times m \times k}$)
$ d $Number of input units (length of x)
$ m $Number of units in each linear feature extractor (complexity)
$ k $Number of linear feature extractors
$ b $Matrix of learned biases ($\in \mathbb{R}^{m\times k}$)
$ i $Runs over the number of Maxout units ($\in \left[1,m \right]$)
$ j $Runs over the number of feature extractors ($\in \left[1,k \right]$)


Now, here is how a single layer with five Maxout units and three hidden linear feature extractors looks like. Try hovering units with the mouse to better see the connection scheme.

“But wait, this looks more like at least two layers!”

Yes indeed, this is very important and probably confusing about Maxout. The activation function is implemented using a small sub-network who’s parameters are learned aswell (“Did somebody say ‘Network in Network’?”).

So if we don’t count the input layer, a single layer of Maxout units actually consists of two layers itself (Although I referred to the first layer as input layer, this doesn’t necessarily mean that it is the very first layer in the whole network, but can be the output of a previous layer, too).

Let’s call the first layer the hidden layer. It implements the linear part of the Maxout units. It is a set of fully-connected layers (the columns in the image) with no activation function (which is referred to as affine), thus each unit in this layer just computes the weighted sum of all inputs, which is defined in the second part of the Maxout definition above:

Don’t get confused about the biases. In this definition they form a matrix, however usually biases are implicit by just adding an additional 1 to the inputs, so that the weight matrix is slightly bigger than for the regular inputs only. So actually the bias is an additional weight in the matrix $W$. Maybe think of the bias matrix in this definition as the slice of weights in the weight matrix.

Now the three-dimensional tensor $W$ contains the weights of this first part. The dots in the equation mean that all elements from the first dimension are taken like W[:, i, j] in Python or W(:, i, j) in Matlab. Consequently, $W_{\dots ij}$ is the weight vector of the unit in row $i$ and column $j$.

In the figure above the units in this first part are aranged in a two-dimensional grid. The first dimension of this grid (number of rows) doesn’t have to matches the number of input units, both $j$ and the second dimension $k$ (number of columns) are hyperparameters, which are chosen when the whole architecture is defined. These two parameters control the complexity of the Maxout activation function. The higher the $k$ and $j$ the more accurately any convex function can be approximated. Basically each column of units in this first part performs linear regression.

The second part is much easier. It is just doing max-pooling over each row of the first part, i.e. taking the maximum of the output along each row.

A simple example

Consider the function $f\left(x\right)=x^{2}$.

We can approximate this function with a single Maxout unit that uses three linear pieces $k=3$. So it uses three hidden units.

This Maxout unit would look like this (biases included this time):

Each hidden unit calculates:

This is a simple linear function. The max-pooling unit takes the maximum of these three linear functions.

Take a look at this picture. It shows the $x^{2}$ function and three linear functions that could be learned by the Maxout unit.

Approximation using three linear functions

Finally, try imagining how this would look like with 4, 5, 6 or an arbitrary number of linear functions. That’s right, it would be a nice approximation that is linear everywhere, except for the connection points of the linear parts.

Where is Dropout in all this?

Dropout is a regularization mechanism. It simulates the training of a bag of networks with different architectures. However, this bag of networks contains only those networks that can be created by dropping an arbitrary number of units from the original network. In practice this is implemented by randomly dropping neurons with a certain probability during training. As a result in each training pass a kind of different network is trained, however – and this is important – all networks share the same weights, because acutally only a single network is used and just the weights of this single network are used all the time.

When applying bagging, i.e. using $n$ models for prediction rather than just one, it is necessary to combine the predictions of all models, which can be easily done by calculating the mean of all predictions. In the dropout case this can not be done, because we actually have only one model. (Ok, it could be done but it doesn’t make sense, since the number of models is way too big). Instead, a much better approach is to just scale the weights the whole network proportional to the drop ratio. The issue then is, that this is only accurate for a single linear layer with the Softmax function applied. Thus, for deeper models that use a non-linear activation function it is not accurate anymore.

And that is why Maxout works as we have seen above. Dropout can be applied to the first part of Maxout and model averaging is still accurate, because there are no non-linearities involved. Pretty clever.