I am studying some machine learning on my own and I am practicing (in Python) with the assignments of the course held by Andrew Ng.

After completing the fourth exercise by hand, I tought to do it in Keras to practice with the library.

In the exercise we have 5000 images of hand written digits, going from 0 to 9. Each image is a 20x20 matrix. The dataset is stored in a matrix X of shape 5000x400 (each image has been 'unrolled') and the labels are stored in a matrix y of shape 5000x10. Each row of y is a hot-one vector.

The exercise asks to implement backpropagation to maximaze the log likelihood, for a simple neural network with one input layer, one hidden layer and one output layer. The hidden layer has 25 neurons and the output layer 10. We use sigmoid as activation for both layers.

My code in Keras is this

```
model=Sequential()
model.add(Dense(25,input_shape=(400,),use_bias=True,kernel_regularizer=regularizers.l2(1),activation='sigmoid',kernel_initializer='glorot_uniform'))
model.add(Dense(10,use_bias=True,kernel_regularizer=regularizers.l2(1),activation='sigmoid',kernel_initializer='glorot_uniform'))
model.compile(loss='categorical_crossentropy',optimizer='sgd',metrics=['accuracy'])
model.fit(X, y, batch_size=5000,epochs=100, verbose=1)
```

Since I want this to be as similar as possible to the assignment I have used the same initial weights as the assignment, the same regularization parameter, the same activations and gradient descent as a optimizer (actually the assignment uses the Truncated Newton Method but I don't think my problem lies here).

I thought I was doing everything correctly but when I train the network I get a 10% accuracy on the training dataset. Even playing a little bit with the parameters the accuracy doesn't change much. To try to understand better the problem I tested it with smaller pieces of the dataset. For instance if I select a subdataset of 100 elements containing x images of zero and 100-x images of one, I get a x% training accuracy. My guess is that the network is optimizing the parameters to recognise only the first digit.

Now my questions are: what I am missing? Why isn't this the right implementation of the neural network described above?

## 1 comments

## @Shubham Panchal 2018-12-06 15:37:33

If you are practising on the MNIST dataset, to classify 10 digits, you have 10 classes to predict. Rather than sigmoid, you should use ReLU in the hidden layers ( in your case the first layer ) and use softmax activation on the output layer. Use categorical crossentropy loss function with adam or sgd optimizer.