# The Perceptron

The perceptron is  one of the earliest  supervised classification algorithms in machine learning. It was introduced by Frank Rosenblatt in 1958 [1]. The idea of perceptron was inspired by the neurons in the brain. The inputs of the perceptron that may come from environment resembles the dendrites of a neuron and the output resembles the axon of a neuron. Mathematically, outputs are the weighted sum of the inputs. These weights are also called synaptic weights. The perceptron algorithm seeks to find these weights which better classifies the data in the input.

Fig. 1 Illustration for a single perceptron. image source http://www.codeproject.com/KB/recipes/NeuralNetwork_1/NN2.png

For a binary classifier (i.e. single perceptron), the relationship between the input $\left\{x_1, x_2, \cdots, x_d \right\} \in \mathbb{R}$ and the output $y$ in terms of the weights is:

$\sum_{j=1}^d w_jx_j + w_0$

where $w_0$ is the intercept value to make the model more general. This can also be written as a dot product:

$y = \mathbf{w}^T \mathbf{x}$

where $\mathbf{x} \triangleq \left[1, x_1, x_2, \cdots, x_d \right]^T$ and $\mathbf{w} \triangleq \left[ w_0, w_1, w_2, \cdots, w_d \right]^T$. We can use this equation to divide the input space into two. Let $s(.)$ be a threshold function such that $s(a) = \begin{cases} 1 &\mbox{if } a > 0 \\ 0 & \mbox{otherwise} \end{cases}$ then we can $\mbox{choose} = \begin{cases} C_1 & \mbox{if } s\left(\mathbf{w}^T \mathbf{x} \right) > 0 \\ C_2 & \mbox{otherwise} \end{cases}$

Here, it is assumed that a hyperplane $\mathbf{w}^T\mathbf{x}$ can be found that separates $\mathbf{x}^t \in \mathcal{C}_1$ and $\mathbf{x}^t \in \mathcal{C}_2$. Using a smoother threshold will make the perceptron less sensitive to abrupt changes in the input: $\begin{array}{rcl} o & = & \mathbf{w}^T \mathbf{x} \\ y & = & sigmoid\left( o\right) = \frac{1}{1+exp\left(-\mathbf{w}^T \mathbf{x} \right)} \end{array}$ This sigmoid function can also be viewed as the activation probability of the output given the input and the weights. When there are $K > 2$ outputs, there are $K$ perceptrons, each of which has a weight vector $\mathbf{w}_i$ $\begin{array}{rcl} y_i &=& \sum_{j=1}^d w_{ij} + w_{i0} = \mathbf{w}_i^T \mathbf{x} \\ \mathbf{y} &=& \mathbf{W}\mathbf{x} \end{array}$ where $w_{ij}$ is the weight from input $x_j$ to output $x_i$. $\mathbf{W}$ is the $K \times (d+1)$ weight matrix of $w_{ij}$ whose rows are the weight vectors of the K perceptrons. When used for classification, during testing, we $\mbox{choose } C_i \mbox{ if } y_i = \max_k y_k$ and the output units using softmax are $\begin{array}{rcl} o_i &=& \mathbf{w}_i^T \mathbf{x} \\ y_i &=& \frac{\exp\left(o_i\right)}{\sum_k\exp\left(o_k\right)}. \end{array}$

In online learning, we do not write the error function over the whole sample but on individual instances. Starting from random initial weights, at each iteration we adjust the parameters a little bit to minimize error, without forgetting what we have previously learned. If this error function is differentiable, we can use gradient descent.

For example, in regression the error on the single instance pair with index $t$, $\left( x^t, r^t \right)$ is $E^t \left( \mathbf{w} | \mathbf{x}^t, r^t r\right) = \frac{1}{2} \left(r^t - y^t\right) ^2 = \frac{1}{2} \left[r^t - \left( \mathbf{w}^T \mathbf{x}^t\right) \right]^2$ and for $j = 0, \cdots, d$, the online update rule is $\Delta w_j^t = \eta \left(r^t - y^t \right)x_j^t$ where $\eta$ is the learning factor, which is gradually decreased in time for convergence. This is known as stochastic gradient descent.

Similarly, update rules can be derived for classification problems using logistic discrimination where updates are done after each pattern. With two classes, for the single instance $\left(\mathbf{x}^t, \mathbf{r}^t \right)$ where $r_i^t = 1$ if $\mathbf{x}^t \in \mathcal{C}_1$ and $r_i^t = 0$ if $\mathbf{x}^t \in \mathcal{C}_2$, the single output is $y^t = sigmoid \left( \mathbf{w}^T \mathbf{x}^t \right)$ and the cross-entropy is $E^t \left( \left\{ \mathbf{w}_i \right\} | \mathbf{x}^t, \mathbf{r}^t \right) = -\sum_i r_i^t \log y_i^t + \left(1-r_i^t \right) \log \left( 1-y_i^t \right).$ Note that this is the negative log likelihood of iid Bernouilli variables. Using gradient descent, we get the following online update rule $j = 0, \cdots, d$: $\Delta w_j^t = \eta \left(r^t - y^t \right)x_j^t.$ When there are $K>2$ classes, the outputs are $y_i^t = \frac{\exp \left( \mathbf{w}_i^T \mathbf{x}^t \right)}{ \sum_k \exp \mathbf{w}_k^T \mathbf{x}^t }$ and the cross-entropy is $E^t \left( \left\{ \mathbf{w}_i \right\} | \mathbf{x}^t, \mathbf{r}^t \right) = - \sum_i r_i^t log y_i^t.$ Using the gradient descent, we get the following online update rule, for $i = 1, \cdots, K$, $j = 0, \cdots, d$: $\Delta w_{ij}^t = \eta \left(r_i^t - y_i^t \right)x_j^t.$

Example

Suppose we want to classify the 2 -D binary data in OR algorithm as shown in Fig. 2. Our aim is to find a linear line which can separate two classes.

Fig.2 Illustration of OR algorithm. Blue dots indicate 1, red dots indicate 0 values for a 2D binary data.

We then apply preceptron update rule for this purpose. In Fig. ??, we show linear lines for the classification until convergence and after convergence. Note that, the dash lines (before convergence) fail to separate two classes. On the other hand, the solid line (after convergence) can successfully separate two classes. The perceptron algorithm stops as soon as there is no error in the training data set. It does not care about the location or the shape of the line. The python script for these results can be downloaded here.

Fig 3. Illustration of perceptron classification for OR algorithm. Dash lines indicate the classifier before convergence and the solid line is the resulting classifier.

Support vector machines (SVM), in fact, use similar idea but they don’t stop update rule until they find the line which maximizes the distance (margin) from each class. Now, you can try running SVM and compare the discriminator with the one from perceptron.

The perceptron does a really good job separating classes in OR but as we discussed in theory, it can only separate classes which are linearly separable. For example, in Fig, 4 illustrates the case where perceptron fails.

Fig. 4 Illustration of perceptron classification for XOR algorithm. The perceptron does not work in this case since classes are not linearly separable.

[1] Rosenblatt, F., 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review 65, 386.

This entry was posted in Data Science, Programming and tagged , , , , , . Bookmark the permalink.