If you have an training accuracy of 90%, and you are not happy with that, what do you do? Collect more data? Try other algos? Collect more diverse data? How to decide what to do?
Have a single real number metric to compare various algos.
You may combined precision and recall (say by using harmonic mean of the two).
Some metrics could be satisficing, e.g. running time of classification should be within a threshold. Others would be optimizing.
Dev and test set should come from the same distribution and should reflect the data you want to do well on.
When data was less abundant, 60/20/20 would be good training/dev/test was good split. In data abundant neural net scenario, 98/1/1 is good distribution. Test set should be just good enough to give high confidence in overall performance of system. Some people just omit test set too.
Sometimes you may wish to change metric mid way. While building cat vs no cat image, may be in a "better" classfier, pornographic images are classified as cat images. So, you need to change cost function to penalizing this misclassification heavily.
For perception problems, human level performance is close to bayes' error. You should try to consider the best human level performance possible. Eg. in radiology an expert radiologist could be better than average radiologist and team of experts may better than a single expert. You should consider the way which gives lowest possible error.
You should compute all these errors and that will help you decide how to improve your algorithm.
Tasks where machines can outperform humans: online ads, loan approvals, product recommendations, logistics. (Structured data, not natural perception problems)
Also, in some speech recognition, image recognition and radilogy tasks, computers surpass single human performance.
When training error is not good enough, you manually examine mispredictions. You should examine a subset of mispredictions and examine manually the reason for errors. Is it that dogs are being mislabeled as cats? Or is it that lion/cheetah are mislabelled as cats? Or is it that blurry images are mislabelled as cats? Figure out prominent reason and try to solve that. If lots of dogs are being mislabelled as cats, make sense to put more dog images in training set.
Sometimes data could have mislabelled examples. Some mislabels in training set are okay, because NN algos are robust to that, as long as errors are random. In dev/test you should first estimate how much boost you would get by correcting the labels, and then correct the labels if you find that will give you a boost. If you fix dev set, fix test set too. You should ideally fix the examples that your algo got right because of misprediction. But it is not easy for accurate algos, as there would be large number of items to examine.
You understand data and challenges only when you iterate. Build first system quickly and use bias/variance analysis to prioritize next steps.
DL algos are data hungry. Teams want to shove in as much as data as they can get hold of. For example, you can get images from internet, or you can purchse data. You can use data from various sources to train, but dev/test set data should only contain the examples which are representative of your use case.
When your training and dev set are from different distributions, training error and dev error difference may not reflect variance. It may just be that training test is easy. To catch this difference, you can have training dev set carved out of training set. Now:
What if you have data mismatch problem? Perform manual inspection. May be lot of dev/test are noisy (in a speech recognition system). In that case you can add noise in training set. But be careful: if you have 10K hour worth of training data, you should add 10K hour worth of noise too. If you just repeat 1hr worth of noise, you will overfit. Note that to human ear all noise will appear the same, but machine will overfit. Similarly for computer vision, you can synthesize images with background cars etc.
Say you have trained image recognition NN. We want to adapt it to radiology. What we can do is to chop off the last layer, replace it as per new task (with randomly initialized weights) and relearn the weights. This works because lot of low level learning like detecting edges and shapes can be transferred.
When does it makes sense? You transfer from task A to task B, when
Learn multiple classifications at once. For example, you may need to detect if an image has (a) pedestrians, (b) cars, (c) stop signs, (d) traffic lights. In this case, in final layer, you will have four neurons. Loss function will be sum of losses over different predictions. If in some training examples, label is unavailable for some variables, you skip computing the particular component in the loss for those particular examples. It makese sense when
Multi task learning is used much less often than tranfer learning. People just train different NNs.
Various stages of traditional ML are replaced by 1 NN. Eg. in speech recognition, people tranform input audio clip to features, then to phonemes, then to words and then to transcript. In DL you directly map input to output.
Sometimes, you break the process in some parts. Example, if you want to detect a person in a photo, you can do it in two phases (a) First detect where the face is (b) Then detect who the person is.
In machine translation, say English to French, end to end learning works well because there is a lot of learning data.