Back-propagation learning for multi-layer feedforward neural network

Below are the objectives of this post:

  1. What is multi-layer feed-forward neural network
  2. Discuss back-propagation algorithm which is used to train it
  3. Implement what we discuss in python to gain better understanding
  4. Execute the implementation for a binary classification use-case to get a practical perspective

Multi-layer feed-forward neural network consists of multiple layers of artificial neurons.  As the name suggests, one layer acts as input to the layer after it and hence feed-forward. But at the same time the learning of weights of each unit in hidden layer happens backwards and hence back-propagation learning.

For this discussion let us consider the neural network shown in Figure 1. It has three neural units in each layers and b1, b2, b3 are included as units (with input as 1) for bias. Figure 1 has one input layer, one output layer (layer L) and 2 hidden layers (L-1 and L-2).

Figure1

Below are two high level steps in building a multi-layer feed-forward neural network model. These steps are executed iteratively:

  1. Feed-forward: Data from input layer is fed forward through each layer and then output is generated in the final layer.
  2. Back-propagation: This is the learning step. Once the output is generated, the error is calculated w.r.t. the expected output and then this error is propagated in backwards direction to adjust the weights to reduce the error. This is the learning step.

Let us discuss each of the steps in detail:

  1. Feed-forward: On a high-level two steps are executed at every unit during feed-forward:
    1. Weighted sum of all inputs is calculated (including bias). Figure 2 shows a detailed view of x14 unit. z14is the input to x14 and is weighted sum of all three activations (non-linearized output) and the bias from the previous layer i.e. 4_WeightedSumEq
    2. Then non-linearization of the input (calculated in #1 above) is done i.e. 5_SigmoidOutput. Here 6_a14 is the output of the unit as shown in Figure 27_Figure2Similarly, each unit receives the inputs from previous layers and passes the non-linearized output to the next layer. The functions for non-linearization are chosen such that they are differentiable (we will answer ‘why?’ while talking about back-propagation). Below are few notations to keep in mind for generalization:8_Notations
  2. Back-propagation learning: The overall objective of backpropagation learning is to learn combination of weights which minimizes the output error. The error between actual output and the desired output is reduced (in each iteration) by doing supervised adjustment in weights.  We will not be able to visualize the space with all weights and error but for intuitive understanding let’s see error with two weights in Figure 3. Objective is to learn weights such that value of E is minimum (as pointed by arrow in Figure 3).9_Figure3Now the question is, how to learn this combination of weights? We will do this learning using gradient descent so that weights learnt, in each iteration, move ‘generated output’ in the direction towards the ‘expected output’. As gradient descent requires the cost function to be differentiable, we have taken sigmoid as activation function.Now let us write the above understanding in maths so that we can implement it later in python. For neural network in Figure 1, the error function is defined as: 10_ErrorExpression where 11_yi is the expected output of ith unit of output layer. We need to find  12_dedw for all units of all layers and adjust each weight as below:

13_weightadjustment

where 14_eta is the learning rate.

Let us derive an expression for 12_dedw which can be generalized to any weight and any layer in the network.

15_chainrule

Let us call 16_dedz as 17_deltaj , which represents change in error with unit change in input to jth unit of lth layer. So,

18_dedwExpression

Let us compute 17_deltaj if l = ‘L’ i.e. the output layer for any neural network (L= 4 in Figure 1) then,

19_deltaljderivation

Now let’s derive an expression for 17_deltaj for hidden layers:

20_deltaljforhidden

Based on the above equation we can calculate 21_deltalminus1 for all units of all layers, provided the same has been calculated for the layer ahead. So we will have to calculate it backwards. So let’s conclude:

24_objectiverule

25_finalequations

Similarly, we will derive expression for   22_dedb as:

23_dedbderivation

26_dedb_added

Finally, the weight updates happens as earlier discussed:

13_weightadjustment

27_weightadjustment_bias

For online learning, one example is introduced at a time and weights are updated incrementally (for each example).

For batch learning, a batch of examples is introduced and weights are updated only once for the whole batch. So an average of gradients for all examples is used to update the weights.

Below is the python implementation for feed-forward backpropagation in batch learning mode:


"""
Author: Ashish Verma
This code was developed to give a clear understanding of what goes behind the curtains in multi-layer feedforward backpropagation neural network.
Feel free to use/modify/improve/etc.

Caution: This may not be an efficient code for production related usage so thoroughly review and test the code before any usage.
"""

import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from scipy import spatial

class multiLayerFeedForwardNN:
    def __init__(self, listOfNeuronsEachLayer):
        self.listOfNeuronsEachLayer = listOfNeuronsEachLayer
        self.dictOfWeightsEachLayer = {}
        self.dictOfActivationsEachLayer = {}
        self.dictOfDeltaEachLayer = {}
        self.dictOfZeeEachLayer = {}
        self.dictOfGradientsEachLayer = {}
        self.listAvgCost = []
        #Initialize weights with random values (this also includes bias terms)
        for iLayer in range(1,len(listOfNeuronsEachLayer)):
            self.dictOfWeightsEachLayer[iLayer + 1] = np.mat(np.random.random((listOfNeuronsEachLayer[iLayer], listOfNeuronsEachLayer[iLayer-1]+1)))

    def sigmoidFunc(self, z):
        result = 1/(1 + np.exp(-1*z))
        return result

    def feedForward(self, inputChunk):
        #Loop over all layers
        for iLayer in range(1, len(self.listOfNeuronsEachLayer)):
            #Save input itself as the activation for the input layer. We will use it later in the code
            if(iLayer == 1):
                self.dictOfActivationsEachLayer[1] = inputChunk
            #Insert extra element for bias having value 1 in input elements
            inputChunk = np.hstack([inputChunk, np.mat(np.ones(len(inputChunk))).T])
            #Calculate z using matrix multiplication of weights and inputs
            self.dictOfZeeEachLayer[iLayer+1] = np.dot(inputChunk, self.dictOfWeightsEachLayer[iLayer+1].T)
            #Calculate 'a' after non-linearizing z
            self.dictOfActivationsEachLayer[iLayer+1] = self.sigmoidFunc(self.dictOfZeeEachLayer[iLayer+1])
            #Consider activation of the current layer as input to the next layer
            inputChunk = self.dictOfActivationsEachLayer[iLayer+1]

    def calculateDelta(self, outputChunk):
        #Calculate delta backwards (output layer to 2nd layer)
        for iLayer in range(len(self.listOfNeuronsEachLayer),1,-1):
            #For last layer calculate delta
            if(iLayer == len(self.listOfNeuronsEachLayer)):
                self.dictOfDeltaEachLayer[iLayer] = np.multiply((self.dictOfActivationsEachLayer[iLayer] - outputChunk), np.multiply(self.dictOfActivationsEachLayer[iLayer],(1-self.dictOfActivationsEachLayer[iLayer])))
            #For rest of the layers calculate delta using delta of next layer
            else:
                wDelta = np.dot(self.dictOfDeltaEachLayer[iLayer+1], np.delete(self.dictOfWeightsEachLayer[iLayer + 1],self.dictOfWeightsEachLayer[iLayer + 1].shape[1]-1,1))
                dadz = np.multiply(self.dictOfActivationsEachLayer[iLayer],(1-self.dictOfActivationsEachLayer[iLayer]))
                self.dictOfDeltaEachLayer[iLayer] = np.multiply(wDelta, dadz)

    def calculateErrorGradient(self):
        #Calculate error gradient for all layers
        for iLayer in range(2,len(self.listOfNeuronsEachLayer)+1):
            #Gradient w.r.t all weights
            self.dictOfGradientsEachLayer[iLayer] = np.dot(self.dictOfDeltaEachLayer[iLayer].T,self.dictOfActivationsEachLayer[iLayer-1])
            #Gradient w.r.t. biases and add it to the other gradients
            self.dictOfGradientsEachLayer[iLayer] = np.hstack([self.dictOfGradientsEachLayer[iLayer], np.sum(self.dictOfDeltaEachLayer[iLayer],0).T])

    def updateWeights(self, learningRate):
        #For all layers update weights using average of gradients over all examples
        for iLayer in range(2, len(self.listOfNeuronsEachLayer)+1):
            self.dictOfWeightsEachLayer[iLayer] = self.dictOfWeightsEachLayer[iLayer] - (learningRate/len(self.dictOfActivationsEachLayer[1]))*self.dictOfGradientsEachLayer[iLayer]

    def calculateAverageCost(self, actualOutput, expectedOutput):
        #Calculate average of error function
        avgOverLayer = np.average([(1/2.)*x**2 for x in np.array(actualOutput-expectedOutput)],1)
        return np.average(avgOverLayer)            

    def trainBatchFFBP(self, learningRate, inputData, outputData, miniBatchSize, maxIterations):
        #If the size of the data is large and resources are less, then divide it into batches for computation.
        #Define batch size accordingly or leave empty if data size is not big
        if(miniBatchSize == ''):
            miniBatchSize = len(inputData)
        #Initialize iteration counter
        iterations = 0
        #Iterate over all examples 'maxIterations' number of times
        #We can even let the program iterate untill the desired error reduction is achieved (this code doesn't implement it).
        for iterations in range(maxIterations):
            #iBatch takes care of dividing the full data in batches (if we chose to, incase the data is big)
            startRecord = 0
            for iBatch in range((len(inputData)/miniBatchSize)):
                endRecord = startRecord + miniBatchSize
                inputChunk = inputData[startRecord:endRecord]
                outputChunk = outputData[startRecord:endRecord]
                self.feedForward(inputChunk)
                self.calculateDelta(outputChunk)
                self.calculateErrorGradient()
                self.updateWeights(learningRate)
                startRecord = endRecord
                #print "iBatch", iBatch
                #--------------Just stack up all calculated output for average error calculation-----------------#
                if(iBatch == 0):
                    activationMat = self.dictOfActivationsEachLayer[len(self.listOfNeuronsEachLayer)].copy()
                else:
                    activationMat = np.vstack([activationMat,self.dictOfActivationsEachLayer[self.listOfNeuronsEachLayer]])
                #------------------------------------------------------------------------------------------------#
            if(endRecord < len(inputData)):
                inputChunk = inputData[startRecord:]
                outputChunk = outputData[startRecord:]
                self.feedForward(inputChunk)
                self.calculateDelta(outputChunk)
                self.calculateErrorGradient()
                self.updateWeights(learningRate)
                activationMat = np.vstack([activationMat,self.dictOfActivationsEachLayer[self.listOfNeuronsEachLayer]])
            self.listAvgCost.append(self.calculateAverageCost(activationMat, outputData))
            iterations = iterations + 1
            print iterations

Now, let us use the implementation for a binary classification task. We will be using the dataset obtained from https://archive.ics.uci.edu/ml/datasets/Ionosphere.

Ionosphere dataset’s output is either ‘g’ (good) or ‘b’ (bad). Let’s replace ‘g’ by 1 and ‘b’ by 0. This dataset has 34 features as input, so the input layer will have 34 neurons and we are using sigmoid activation function so we will be taking 1 neuron in output layer for binary classification.


#Import dataset
dataset = np.genfromtxt ('C:\Research\BlogPloration\BackPropogation\ComparisonDataset\ionosphere.csv', delimiter=",")

#Initialize 'multiLayerFeedForwardNN' with 34, 27 and 1 neurons in input, hidden and output layers respectively
objMLFFNN = multiLayerFeedForwardNN([34,27,1])
#Start training with learning rate as 0.5 and 120,000 iterations
objMLFFNN.trainBatchFFBP(0.5, np.mat(dataset[:,:34]), np.mat(dataset[:,34]).T,'', 120000)

This NN model fits the data fully. Let’s look at the plot of ‘error function value’ calculated with iterations during training. Error for every iteration is stored in ‘listAvgCost’.

errorPlot

Note: In order to keep things simple, I haven’t talked about regularization term and other aspects. I would suggest the readers to study further once the basic is ‘very’ clear.