Deep Learning Theory

Why deep learning?
- conventional machine learning methods can not utilize the power of big data
- their performance saturates after a certain number of data
Why sudden boom?
- lots of data
- computational power
- new algorithms for fast optimization
Types of data
- structured - housing prices, financial
- unstructured - Images, Audio, Speech, Text (deep learning very strong here)
Logistic regression
- classification algorithm

. Shallow Neural Network

Activation functions
Sigmoid - binary classification, output layer
Tanh - normalized sigmoid, centers the data to avoid extremes which improves learning
Relu - removes negative activations, hidden layer, removes vanishing gradient
Leaky relu - to avoid undefined slope at (0,0)

with no activation a = z, so no non-linearity is learnt by the network. Thus, the whole neural network simply becomes linear regression

Initialization of weights
- if all weights are intialized to zero, all activations are same, all nodes have same effect on output, all weights are updated by same amount. All units learn exactly the same features - no new learning
- to break this symmetry, weights are initialized randomly
- also, weights should be small to prevent extreme values of activation which cause vanishing gradient
Deep neural network
- learn more complex features than shallow networks
- require lot of data and computation power