Latest revision as of 23:02, 5 March 2024

Training Dataset: The sample of data used to fit the model.
Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.
Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.
CalypsoAI Validating vision ML models via tests: brightness, contrast, saturation, pixilation, compression, blurring, random noise
Landing AI ...LandingLens platform includes a wide array of features to help teams develop and deploy reliable and repeatable inspection systems

1*OJVhBtg5YgeW7rKXoxKQxg.png

Train, Test, & Validation Sets explained In this video, we explain the concept of the different data sets used for training and testing an artificial neural network, including the training set, testing set, and validation set. We also show how to create and specify these data sets in code with Keras.

Lecture 0603 Model selection and training/validation/test sets Machine Learning by Andrew Ng [Coursera]

If you have an training accuracy of 90%, and you are not happy with that, what do you do? Collect more data? Try other algorithms? Collect more diverse data? How to decide what to do?

CONTENT FROM:

Structuring Machine Learning Projects; Course 3 | Andrew Ng's Deep Learning Series

Orthogonalization (C3W1L02) | Andrew Ng

Chain of assumptions in Machine Learning (ML)

Fit training set well on cost function If this is not happening, try bigger network, or different optimization algorithm. You should achieve human level performance here.
Fit dev set well on cost function If this is not happening, it means you are overfitting training set. Try Regularization, or train on a bigger training set. Or a different neural network (NN) architecture.
Fit test set well on cost function If fit on test set is much worse than fit on dev set, it means you have overfit the dev set. You should get a bigger dev set. Or try a different neural network (NN) architecture.
Perform well in the real world If performance on dev/test set is good, but performance in real world is bad, check if cost function is really what you care about.

Andrew Ng does not like early stopping because it affects fit on both training and dev sets, which leads to confusion.

Single real number evaluation metric

Have a single real number metric to compare various algorithms.

You may combined precision and recall (say by using harmonic mean of the two).

Some metrics could be 'satisficing', e.g. running time of classification should be within a threshold. Others would be optimizing.

Train/dev/test set distributions

Dev and test set should come from the same distribution and should reflect the data you want to do well on.

When data was less abundant, 60/20/20 would be good training/dev/test was good split. In data abundant neural net scenario, 98/1/1 is good distribution. Test set should be just good enough to give high confidence in overall performance of system. Some people just omit test set too.

Sometimes you may wish to change metric mid way. While building cat vs no cat image, may be in a "better" classfier, pornographic images are classified as cat images. So, you need to change cost function to penalizing this misclassification heavily.

Human level performance

For perception problems, human level performance is close to bayes' error. You should try to consider the best human level performance possible. Eg. in radiology an expert radiologist could be better than average radiologist and team of experts may better than a single expert. You should consider the way which gives lowest possible error.

Difference between 0 and human level performance is bayes' error Difference between human level performance and training error is avoidable bias Difference between training error and dev error is variance Difference between dev error and test error is overfitting to dev set You should compute all these errors and that will help you decide how to improve your algorithm.

Tasks where machines can outperform humans: online ads, loan approvals, product recommendations, logistics. (Structured data, not natural perception problems)

Also, in some Speech Recognition, image recognition and radiology tasks, computers surpass single human performance.

Error Analysis

When training error is not good enough, you manually examine mispredictions. You should examine a subset of mispredictions and examine manually the reason for errors. Is it that dogs are being mislabeled as cats? Or is it that lion/cheetah are mislabeled as cats? Or is it that blurry images are mislabeled as cats? Figure out prominent reason and try to solve that. If lots of dogs are being mislabeled as cats, make sense to put more dog images in training set.

Sometimes data could have mislabeled examples. Some mislabels in training set are okay, because NN algos are robust to that, as long as errors are random. In dev/test you should first estimate how much boost you would get by correcting the labels, and then correct the labels if you find that will give you a boost. If you fix dev set, fix test set too. You should ideally fix the examples that your algo got right because of misprediction. But it is not easy for accurate algos, as there would be large number of items to examine.

Build first, then iterate

You understand data and challenges only when you iterate. Build first system quickly and use bias/variance analysis to prioritize next steps.

Mismatched training and dev/test set

DL algos are data hungry. Teams want to shove in as much as data as they can get hold of. For example, you can get images from internet, or you can purchase data. You can use data from various sources to train, but dev/test set data should only contain the examples which are representative of your use case.

When your training and dev set are from different distributions, training error and dev error difference may not reflect [[Bias and Variances|variance]. It may just be that training test is easy. To catch this difference, you can have training dev set carved out of training set. Now:

Difference between training and training dev is the variance. Difference between training dev and dev is a measure of mismatch between training and test data. What if you have data mismatch problem? Perform manual inspection. May be lot of dev/test are noisy (in a Speech Recognition system). In that case you can add noise in training set. But be careful: if you have 10K hour worth of training data, you should add 10K hour worth of noise too. If you just repeat 1 hour worth of noise, you will overfit. Note that to human ear all noise will appear the same, but machine will overfit. Similarly for computer vision, you can synthesize images with background cars etc.

Orthogonalization (C3W1L02 ) Andrew Ng Take the Deep Learning Specialization: https://bit.ly/2IohyTa Check out all our courses: https://www.deeplearning.ai

@@ Line 2: / Line 2: @@
 |title=PRIMO.ai
 |titlemode=append
-|keywords=artificial, intelligence, machine, learning, models, algorithms, data, singularity, moonshot, Tensorflow, Google, Nvidia, Microsoft, Azure, Amazon, AWS
+|keywords=ChatGPT, artificial, intelligence, machine, learning, GPT-4, GPT-5, NLP, NLG, NLC, NLU, models, data, singularity, moonshot, Sentience, AGI, Emergence, Moonshot, Explainable, TensorFlow, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Hugging Face, OpenAI, Tensorflow, OpenAI, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Meta, LLM, metaverse, assistants, agents, digital twin, IoT, Transhumanism, Immersive Reality, Generative AI, Conversational AI, Perplexity, Bing, You, Bard, Ernie, prompt Engineering LangChain, Video/Image, Vision, End-to-End Speech, Synthesize Speech, Speech Recognition, Stanford, MIT |description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools
-|description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools
+<!-- Google tag (gtag.js) -->
+<script async src="https://www.googletagmanager.com/gtag/js?id=G-4GCWLBVJ7T"></script>
+<script>
+  window.dataLayer = window.dataLayer || [];
+  function gtag(){dataLayer.push(arguments);}
+  gtag('js', new Date());
+  gtag('config', 'G-4GCWLBVJ7T');
+</script>
 }}
-[http://www.youtube.com/results?search_query=Train+Validate+Test+Neural+Network YouTube search...]
+[https://www.youtube.com/results?search_query=Train+Validate+Test+Neural+Network YouTube search...]
-[http://www.google.com/search?q=Train+Validate+Test+Neural+Network ...Google search]
+[https://www.google.com/search?q=Train+Validate+Test+Neural+Network ...Google search]
+* [[AI Solver]] ... [[Algorithms]] ... [[Algorithm Administration|Administration]] ... [[Model Search]] ... [[Discriminative vs. Generative]] ... [[Train, Validate, and Test]]
+* [[Strategy & Tactics]] ... [[Project Management]] ... [[Best Practices]] ... [[Checklists]] ... [[Project Check-in]] ... [[Evaluation]] ... [[Evaluation - Measures|Measures]]
+** [[Evaluation - Measures#Accuracy|Accuracy]]
+** [[Evaluation - Measures#Precision & Recall (Sensitivity)|Precision & Recall (Sensitivity)]]
+** [[Evaluation - Measures#Specificity|Specificity]]
+** [[Benchmarks]]
+** [[Bias and Variances]]
+** [[Explainable / Interpretable AI]]
+** [[Algorithm Administration#Model Monitoring|Model Monitoring]]
+** [[ML Test Score]]
+* [[AI Verification and Validation]]
 * [[Objective vs. Cost vs. Loss vs. Error Function]]
-* [[Evaluation Measures - Classification Performance]]
 * [[Overfitting Challenge]]
-* [http://medium.com/datadriveninvestor/data-science-essentials-why-train-validation-test-data-b7f7d472dc1f Data Science essentials: Why train-validation-test data? | Sagar Patel - Medium]
+* [https://medium.com/datadriveninvestor/data-science-essentials-why-train-validation-test-data-b7f7d472dc1f Data Science essentials: Why train-validation-test data? | Sagar Patel - Medium]
-* [http://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7 About Train, Validation and Test Sets in Machine Learning | Tarang Shah - Towards Data Science]
+* [https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7 About Train, Validation and Test Sets in Machine Learning | Tarang Shah - Towards Data Science]
-* [http://machinelearningmastery.com/difference-test-validation-datasets/?source=post_page--------------------------- What is the Difference Between Test and Validation Datasets? | Jason Brownlee - Machine Learning Mastery]
+* [https://machinelearningmastery.com/difference-test-validation-datasets/?source=post_page--------------------------- What is the Difference Between Test and Validation Datasets? | Jason Brownlee - Machine Learning Mastery]
 * Training Dataset: The sample of data used to fit the model.
 * Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.
 * Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.
+* [https://www.calypsoai.com CalypsoAI] Validating vision ML models via tests: brightness, contrast, saturation, pixilation, compression, blurring, random noise
+* [https://landing.ai/ Landing AI] ...LandingLens platform includes a wide array of features to help teams develop and deploy reliable and repeatable inspection systems
-http://miro.medium.com/max/700/1*OJVhBtg5YgeW7rKXoxKQxg.png
+https://miro.medium.com/max/700/1*OJVhBtg5YgeW7rKXoxKQxg.png
+{|<!-- T -->
+| valign="top" |
+{| class="wikitable" style="width: 550px;"
+||
 <youtube>Zi-0rlM4RDs</youtube>
+<b>Train, Test, & Validation Sets explained
+</b><br>In this video, we explain the concept of the different data sets used for training and testing an artificial neural network, including the training set, testing set, and validation set. We also show how to create and specify these data sets in code with [[Keras]].
+|}
+|<!-- M -->
+| valign="top" |
+{| class="wikitable" style="width: 550px;"
+||
 <youtube>MyBSkmUeIEs</youtube>
+<b>Lecture 0603 Model selection and training/validation/test sets
+</b><br>Machine Learning by [[Creatives#Andrew Ng |Andrew Ng]] [Coursera]
+|}
+|}<!-- B -->
+= If you have an training accuracy of 90%, and you are not happy with that, what do you do? Collect more data? Try other algorithms? Collect more diverse data? How to decide what to do? =
-= The Main Question: If you have an training accuracy of 90%, and you are not happy with that, what do you do? Collect more data? Try other algorithms? Collect more diverse data? How to decide what to do? =
 CONTENT FROM:
@@ Line 31: / Line 67: @@
 <b>Structuring Machine Learning Projects; Course 3 | [[Creatives#Andrew Ng|Andrew Ng]]'s Deep Learning Series</b>
-* [http://www.coursera.org/specializations/deep-learning?utm_source=deeplearningai&utm_medium=institutions&utm_campaign=SocialYoutubeDLSC3W1L02 Orthogonalization (C3W1L02) |] [[Creatives#Andrew Ng|Andrew Ng]]
+* [https://www.coursera.org/specializations/deep-learning?utm_source=deeplearningai&utm_medium=institutions&utm_campaign=SocialYoutubeDLSC3W1L02 Orthogonalization (C3W1L02) |] [[Creatives#Andrew Ng|Andrew Ng]]
 == Chain of assumptions in Machine Learning (ML) ==
-# Fit training set well on [[Objective vs. Cost vs. Loss vs. Error Function|cost function]] If this is not happening, try bigger network, or different optimization algorithm. You should achieve human level performance here.
+# Fit training set well on [[Objective vs. Cost vs. Loss vs. Error Function|cost function]] If this is not happening, try bigger network, or different optimization algorithm. You should achieve human level [[Evaluation - Measures|performance]] here.
 # Fit dev set well on [[Objective vs. Cost vs. Loss vs. Error Function|cost function]] If this is not happening, it means you are [[Overfitting Challenge|overfitting]] training set. Try [[Regularization]], or train on a bigger training set. Or a different neural network (NN) architecture.
 # Fit test set well on [[Objective vs. Cost vs. Loss vs. Error Function|cost function]] If fit on test set is much worse than fit on dev set, it means you have [[Overfitting Challenge|overfit]] the dev set. You should get a bigger dev set. Or try a different neural network (NN) architecture.
-# Perform well in the real world If performance on dev/test set is good, but performance in real world is bad, check if [[Objective vs. Cost vs. Loss vs. Error Function|cost function]] is really what you care about.
+# Perform well in the real world If [[Evaluation - Measures|performance]] on dev/test set is good, but [[Evaluation - Measures|performance]] in real world is bad, check if [[Objective vs. Cost vs. Loss vs. Error Function|cost function]] is really what you care about.
 [[Creatives#Andrew Ng|Andrew Ng]] does not like early stopping because it affects fit on both training and dev sets, which leads to confusion.
@@ Line 51: / Line 87: @@
 Dev and test set should come from the same distribution and should reflect the data you want to do well on.
-When data was less abundant, <b>60/20/20</b> would be good training/dev/test was good split. In data abundant neural net scenario, <b>98/1/1</b> is good distribution. Test set should be just good enough to give high confidence in overall performance of system. Some people just omit test set too.
+When data was less abundant, <b>60/20/20</b> would be good training/dev/test was good split. In data abundant neural net scenario, <b>98/1/1</b> is good distribution. Test set should be just good enough to give high confidence in overall [[Evaluation - Measures|performance]] of system. Some people just omit test set too.
 Sometimes you may wish to change metric mid way. While building cat vs no cat image, may be in a "better" classfier, pornographic images are classified as cat images. So, you need to change [[Objective vs. Cost vs. Loss vs. Error Function|cost function]] to penalizing this misclassification heavily.
 == Human level performance ==
-For perception problems, human level performance is close to bayes' error. You should try to consider the best human level performance possible. Eg. in radiology an expert radiologist could be better than average radiologist and team of experts may better than a single expert. You should consider the way which gives lowest possible error.
+For perception problems, human level performance is close to [[bayes]]' error. You should try to consider the best human level performance possible. Eg. in radiology an expert radiologist could be better than average radiologist and team of experts may better than a single expert. You should consider the way which gives lowest possible error.
-Difference between 0 and human level performance is bayes' error
+Difference between 0 and human level performance is [[bayes]]' error
-Difference between human level performance and training error is avoidable bias
+Difference between human level performance and training error is avoidable [[Bias and Variances|bias]]
-Difference between training error and dev error is variance
+Difference between training error and dev error is [[Bias and Variances|variance]]
-Difference betwween dev error and test error is [[Overfitting Challenge|overfitting]] to dev set
+Difference between dev error and test error is [[Overfitting Challenge|overfitting]] to dev set
 You should compute all these errors and that will help you decide how to improve your algorithm.
 Tasks where machines can outperform humans: online ads, loan approvals, product recommendations, logistics. (Structured data, not natural perception problems)
-Also, in some speech recognition, image recognition and radilogy tasks, computers surpass single human performance.
+Also, in some [[Speech Recognition]], image recognition and radiology tasks, computers surpass single human performance.
 == Error Analysis ==
-When training error is not good enough, you manually examine mispredictions. You should examine a subset of mispredictions and examine manually the reason for errors. Is it that dogs are being mislabeled as cats? Or is it that lion/cheetah are mislabelled as cats? Or is it that blurry images are mislabelled as cats? Figure out prominent reason and try to solve that. If lots of dogs are being mislabelled as cats, make sense to put more dog images in training set.
+When training error is not good enough, you manually examine mispredictions. You should examine a subset of mispredictions and examine manually the reason for errors. Is it that dogs are being mislabeled as cats? Or is it that lion/cheetah are mislabeled as cats? Or is it that blurry images are mislabeled as cats? Figure out prominent reason and try to solve that. If lots of dogs are being mislabeled as cats, make sense to put more dog images in training set.
-Sometimes data could have mislabelled examples. Some mislabels in training set are okay, because NN algos are robust to that, as long as errors are random. In dev/test you should first estimate how much boost you would get by correcting the labels, and then correct the labels if you find that will give you a boost. If you fix dev set, fix test set too. You should ideally fix the examples that your algo got right because of misprediction. But it is not easy for accurate algos, as there would be large number of items to examine.
+Sometimes data could have mislabeled examples. Some mislabels in training set are okay, because NN algos are robust to that, as long as errors are random. In dev/test you should first estimate how much boost you would get by correcting the labels, and then correct the labels if you find that will give you a boost. If you fix dev set, fix test set too. You should ideally fix the examples that your algo got right because of misprediction. But it is not easy for accurate algos, as there would be large number of items to examine.
 == Build first, then iterate ==
-You understand data and challenges only when you iterate. Build first system quickly and use [[Approach to Bias and Variances|bias/variance]] analysis to prioritize next steps.
+You understand data and challenges only when you iterate. Build first system quickly and use [[Bias and Variances|bias/variance]] analysis to prioritize next steps.
 == Mismatched training and dev/test set ==
 DL algos are data hungry. Teams want to shove in as much as data as they can get hold of. For example, you can get images from internet, or you can purchase data. You can use data from various sources to train, but dev/test set data should only contain the examples which are representative of your use case.
-When your training and dev set are from different distributions, training error and dev error difference may not reflect [[Approach to Bias and Variances|variance]. It may just be that training test is easy. To catch this difference, you can have training dev set carved out of training set. Now:
+When your training and dev set are from different distributions, training error and dev error difference may not reflect [[Bias and Variances|variance]. It may just be that training test is easy. To catch this difference, you can have training dev set carved out of training set. Now:
 Difference between training and training dev is the variance.
 Difference between training dev and dev is a measure of mismatch between training and test data.
-What if you have data mismatch problem? Perform manual inspection. May be lot of dev/test are noisy (in a speech recognition system). In that case you can add noise in training set. But be careful: if you have 10K hour worth of training data, you should add 10K hour worth of noise too. If you just repeat 1 hour worth of noise, you will [[Overfitting Challenge|overfit]]. Note that to human ear all noise will appear the same, but machine will [[Overfitting Challenge|overfit]]. Similarly for computer vision, you can synthesize images with background cars etc.
+What if you have data mismatch problem? Perform manual inspection. May be lot of dev/test are noisy (in a [[Speech Recognition]] system). In that case you can add noise in training set. But be careful: if you have 10K hour worth of training data, you should add 10K hour worth of noise too. If you just repeat 1 hour worth of noise, you will [[Overfitting Challenge|overfit]]. Note that to human ear all noise will appear the same, but machine will [[Overfitting Challenge|overfit]]. Similarly for computer vision, you can synthesize images with background cars etc.
+{|<!-- T -->
+| valign="top" |
+{| class="wikitable" style="width: 550px;"
+||
 <youtube>UEtvV1D6B3s</youtube>
+<b>Orthogonalization (C3W1L02 )
+</b><br>[[Creatives#Andrew Ng|Andrew Ng]]  Take the Deep Learning Specialization: https://bit.ly/2IohyTa  Check out all our courses: https://www.deeplearning.ai
+|}
+|}<!-- B -->

Difference between revisions of "Train, Validate, and Test"

Latest revision as of 23:02, 5 March 2024

Contents

If you have an training accuracy of 90%, and you are not happy with that, what do you do? Collect more data? Try other algorithms? Collect more diverse data? How to decide what to do?

Chain of assumptions in Machine Learning (ML)

Single real number evaluation metric

Train/dev/test set distributions

Human level performance

Error Analysis

Build first, then iterate

Mismatched training and dev/test set

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools