The perceptron is one of the earliest supervised classification algorithms in machine learning. It was introduced by Frank Rosenblatt in 1958 [1]. The idea of perceptron was inspired by the neurons in the brain. The inputs of the perceptron that may come from environment resembles the dendrites of a neuron and the output resembles the axon of a neuron. Mathematically, outputs are the weighted sum of the inputs. These weights are also called synaptic weights. The perceptron algorithm seeks to find these weights which better classifies the data in the input.

For a binary classifier (i.e. single perceptron), the relationship between the input and the output in terms of the weights is:

where is the intercept value to make the model more general. This can also be written as a dot product:

where and . We can use this equation to divide the input space into two. Let be a threshold function such that then we can

Here, it is assumed that a hyperplane can be found that separates and . Using a smoother threshold will make the perceptron less sensitive to abrupt changes in the input: This sigmoid function can also be viewed as the activation probability of the output given the input and the weights. When there are outputs, there are perceptrons, each of which has a weight vector where is the weight from input to output . is the weight matrix of whose rows are the weight vectors of the K perceptrons. When used for classification, during testing, we and the output units using softmax are

In online learning, we do not write the error function over the whole sample but on individual instances. Starting from random initial weights, at each iteration we adjust the parameters a little bit to minimize error, without forgetting what we have previously learned. If this error function is differentiable, we can use gradient descent.

For example, in regression the error on the single instance pair with index , is and for , the online update rule is where is the learning factor, which is gradually decreased in time for convergence. This is known as stochastic gradient descent.

Similarly, update rules can be derived for classification problems using logistic discrimination where updates are done after each pattern. With two classes, for the single instance where if and if , the single output is and the cross-entropy is Note that this is the negative log likelihood of iid Bernouilli variables. Using gradient descent, we get the following online update rule : When there are classes, the outputs are and the cross-entropy is Using the gradient descent, we get the following online update rule, for , :

**Example**

Suppose we want to classify the 2 -D binary data in OR algorithm as shown in Fig. 2. Our aim is to find a linear line which can separate two classes.

We then apply preceptron update rule for this purpose. In Fig. ??, we show linear lines for the classification until convergence and after convergence. Note that, the dash lines (before convergence) fail to separate two classes. On the other hand, the solid line (after convergence) can successfully separate two classes. The perceptron algorithm stops as soon as there is no error in the training data set. It does not care about the location or the shape of the line. The python script for these results can be downloaded here.

Fig 3. Illustration of perceptron classification for OR algorithm. Dash lines indicate the classifier before convergence and the solid line is the resulting classifier.

Support vector machines (SVM), in fact, use similar idea but they don’t stop update rule until they find the line which maximizes the distance (margin) from each class. Now, you can try running SVM and compare the discriminator with the one from perceptron.

The perceptron does a really good job separating classes in OR but as we discussed in theory, it can only separate classes which are linearly separable. For example, in Fig, 4 illustrates the case where perceptron fails.

[1] Rosenblatt, F., 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review 65, 386.