Structuring Machine Learning Projects

Course 3 of Andrew Ng's Deep Learning Series  Course 1 Course 2

The Main Question

If you have an training accuracy of 90%, and you are not happy with that, what do you do? Collect more data? Try other algos? Collect more diverse data? How to decide what to do?

Chain of assumptions in ML

  1. Fit training set well on cost function If this is not happening, try bigger network, or different optimization algo like Adam. You should achieve human level performance here.
  2. Fit dev set well on cost function If this is not happening, it means you are overfitting training set. Try regularization, or train on a bigger training set. Or a different NN architecture.
  3. Fit test set well on cost function If fit on test set is much worse than fit on dev set, it means you have overfit the dev set. You should get a bigger dev set. Or try a different NN architecture.
  4. Perform well in the real world If perf on dev/test set is good, but performance in real world is bad, check if cost function is really what you care about.
Andrew Ng does not like early stopping because it affects fit on both training and dev sets, which leads to confusion.

Single real number evaluation metric

Have a single real number metric to compare various algos.

You may combined precision and recall (say by using harmonic mean of the two).

Some metrics could be satisficing, e.g. running time of classification should be within a threshold. Others would be optimizing.

Train/dev/test set distributions

Dev and test set should come from the same distribution and should reflect the data you want to do well on.

When data was less abundant, 60/20/20 would be good training/dev/test was good split. In data abundant neural net scenario, 98/1/1 is good distribution. Test set should be just good enough to give high confidence in overall performance of system. Some people just omit test set too.

Sometimes you may wish to change metric mid way. While building cat vs no cat image, may be in a "better" classfier, pornographic images are classified as cat images. So, you need to change cost function to penalizing this misclassification heavily.

Human level peformance

For perception problems, human level performance is close to bayes' error. You should try to consider the best human level performance possible. Eg. in radiology an expert radiologist could be better than average radiologist and team of experts may better than a single expert. You should consider the way which gives lowest possible error.

  1. Difference between 0 and human level performance is bayes' error
  2. Difference between human level performance and training error is avoidable bias
  3. Difference between training error and dev error is variance
  4. Difference betwween dev error and test error is overfitting to dev set

You should compute all these errors and that will help you decide how to improve your algorithm.

Tasks where machines can outperform humans: online ads, loan approvals, product recommendations, logistics. (Structured data, not natural perception problems)

Also, in some speech recognition, image recognition and radilogy tasks, computers surpass single human performance.

Error Analysis

When training error is not good enough, you manually examine mispredictions. You should examine a subset of mispredictions and examine manually the reason for errors. Is it that dogs are being mislabeled as cats? Or is it that lion/cheetah are mislabelled as cats? Or is it that blurry images are mislabelled as cats? Figure out prominent reason and try to solve that. If lots of dogs are being mislabelled as cats, make sense to put more dog images in training set.

Sometimes data could have mislabelled examples. Some mislabels in training set are okay, because NN algos are robust to that, as long as errors are random. In dev/test you should first estimate how much boost you would get by correcting the labels, and then correct the labels if you find that will give you a boost. If you fix dev set, fix test set too. You should ideally fix the examples that your algo got right because of misprediction. But it is not easy for accurate algos, as there would be large number of items to examine.

Build first, then iterate

You understand data and challenges only when you iterate. Build first system quickly and use bias/variance analysis to prioritize next steps.

Mismatched training and dev/test set

DL algos are data hungry. Teams want to shove in as much as data as they can get hold of. For example, you can get images from internet, or you can purchse data. You can use data from various sources to train, but dev/test set data should only contain the examples which are representative of your use case.

When your training and dev set are from different distributions, training error and dev error difference may not reflect variance. It may just be that training test is easy. To catch this difference, you can have training dev set carved out of training set. Now:

What if you have data mismatch problem? Perform manual inspection. May be lot of dev/test are noisy (in a speech recognition system). In that case you can add noise in training set. But be careful: if you have 10K hour worth of training data, you should add 10K hour worth of noise too. If you just repeat 1hr worth of noise, you will overfit. Note that to human ear all noise will appear the same, but machine will overfit. Similarly for computer vision, you can synthesize images with background cars etc.

Transfer Learning

Say you have trained image recognition NN. We want to adapt it to radiology. What we can do is to chop off the last layer, replace it as per new task (with randomly initialized weights) and relearn the weights. This works because lot of low level learning like detecting edges and shapes can be transferred.

When does it makes sense? You transfer from task A to task B, when

Multi task Learning

Learn multiple classifications at once. For example, you may need to detect if an image has (a) pedestrians, (b) cars, (c) stop signs, (d) traffic lights. In this case, in final layer, you will have four neurons. Loss function will be sum of losses over different predictions. If in some training examples, label is unavailable for some variables, you skip computing the particular component in the loss for those particular examples. It makese sense when

Multi task learning is used much less often than tranfer learning. People just train different NNs.

End to end deep learning

Various stages of traditional ML are replaced by 1 NN. Eg. in speech recognition, people tranform input audio clip to features, then to phonemes, then to words and then to transcript. In DL you directly map input to output.

Sometimes, you break the process in some parts. Example, if you want to detect a person in a photo, you can do it in two phases (a) First detect where the face is (b) Then detect who the person is.

In machine translation, say English to French, end to end learning works well because there is a lot of learning data.

Pros

Cons