Difference between revisions of "Benchmarks"

From
Jump to: navigation, search
m (Natural Language Processing (NLP) Evaluation =)
m
 
(46 intermediate revisions by the same user not shown)
Line 30: Line 30:
 
* [[Data Quality]] ...[[AI Verification and Validation|validity]], [[Evaluation - Measures#Accuracy|accuracy]], [[Data Quality#Data Cleaning|cleaning]], [[Data Quality#Data Completeness|completeness]], [[Data Quality#Data Consistency|consistency]], [[Data Quality#Data Encoding|encoding]], [[Data Quality#Zero Padding|padding]], [[Data Quality#Data Augmentation, Data Labeling, and Auto-Tagging|augmentation, labeling, auto-tagging]], [[Data Quality#Batch Norm(alization) & Standardization| normalization, standardization]], and [[Data Quality#Imbalanced Data|imbalanced data]]
 
* [[Data Quality]] ...[[AI Verification and Validation|validity]], [[Evaluation - Measures#Accuracy|accuracy]], [[Data Quality#Data Cleaning|cleaning]], [[Data Quality#Data Completeness|completeness]], [[Data Quality#Data Consistency|consistency]], [[Data Quality#Data Encoding|encoding]], [[Data Quality#Zero Padding|padding]], [[Data Quality#Data Augmentation, Data Labeling, and Auto-Tagging|augmentation, labeling, auto-tagging]], [[Data Quality#Batch Norm(alization) & Standardization| normalization, standardization]], and [[Data Quality#Imbalanced Data|imbalanced data]]
 
* [[Natural Language Processing (NLP)#Managed Vocabularies |Managed Vocabularies]]
 
* [[Natural Language Processing (NLP)#Managed Vocabularies |Managed Vocabularies]]
* [[Excel]] ... [[LangChain#Documents|Documents]] ... [[Database]] ... [[Graph]] ... [[LlamaIndex]]
+
* [[Excel]] ... [[LangChain#Documents|Documents]] ... [[Database|Database; Vector & Relational]] ... [[Graph]] ... [[LlamaIndex]]
 
* [[Analytics]] ... [[Visualization]] ... [[Graphical Tools for Modeling AI Components|Graphical Tools]] ... [[Diagrams for Business Analysis|Diagrams]] & [[Generative AI for Business Analysis|Business Analysis]] ... [[Requirements Management|Requirements]] ... [[Loop]] ... [[Bayes]] ... [[Network Pattern]]
 
* [[Analytics]] ... [[Visualization]] ... [[Graphical Tools for Modeling AI Components|Graphical Tools]] ... [[Diagrams for Business Analysis|Diagrams]] & [[Generative AI for Business Analysis|Business Analysis]] ... [[Requirements Management|Requirements]] ... [[Loop]] ... [[Bayes]] ... [[Network Pattern]]
* [[Development]] ... [[Notebooks]] ... [[Development#AI Pair Programming Tools|AI Pair Programming]] ... [[Codeless Options, Code Generators, Drag n' Drop|Codeless, Generators, Drag n' Drop]] ... [[Algorithm Administration#AIOps/MLOps|AIOps/MLOps]] ... [[Platforms: AI/Machine Learning as a Service (AIaaS/MLaaS)|AIaaS/MLaaS]]
+
* [[Development]] ... [[Notebooks]] ... [[Development#AI Pair Programming Tools|AI Pair Programming]] ... [[Codeless Options, Code Generators, Drag n' Drop|Codeless]] ... [[Hugging Face]] ... [[Algorithm Administration#AIOps/MLOps|AIOps/MLOps]] ... [[Platforms: AI/Machine Learning as a Service (AIaaS/MLaaS)|AIaaS/MLaaS]]
 
* [[Backpropagation]] ... [[Feed Forward Neural Network (FF or FFNN)|FFNN]] ... [[Forward-Forward]] ... [[Activation Functions]] ...[[Softmax]] ... [[Loss]] ... [[Boosting]] ... [[Gradient Descent Optimization & Challenges|Gradient Descent]] ... [[Algorithm Administration#Hyperparameter|Hyperparameter]] ... [[Manifold Hypothesis]] ... [[Principal Component Analysis (PCA)|PCA]]
 
* [[Backpropagation]] ... [[Feed Forward Neural Network (FF or FFNN)|FFNN]] ... [[Forward-Forward]] ... [[Activation Functions]] ...[[Softmax]] ... [[Loss]] ... [[Boosting]] ... [[Gradient Descent Optimization & Challenges|Gradient Descent]] ... [[Algorithm Administration#Hyperparameter|Hyperparameter]] ... [[Manifold Hypothesis]] ... [[Principal Component Analysis (PCA)|PCA]]
 
* [[Strategy & Tactics]] ... [[Project Management]] ... [[Best Practices]] ... [[Checklists]] ... [[Project Check-in]] ... [[Evaluation]] ... [[Evaluation - Measures|Measures]]
 
* [[Strategy & Tactics]] ... [[Project Management]] ... [[Best Practices]] ... [[Checklists]] ... [[Project Check-in]] ... [[Evaluation]] ... [[Evaluation - Measures|Measures]]
Line 38: Line 38:
 
** [[Evaluation - Measures#Precision & Recall (Sensitivity)|Precision & Recall (Sensitivity)]]
 
** [[Evaluation - Measures#Precision & Recall (Sensitivity)|Precision & Recall (Sensitivity)]]
 
** [[Evaluation - Measures#Specificity|Specificity]]
 
** [[Evaluation - Measures#Specificity|Specificity]]
* [[AI Solver]] ... [[Algorithms]] ... [[Algorithm Administration|Administration]] ... [[Model Search]] ... [[Discriminative vs. Generative]] ... [[Optimizer]] ... [[Train, Validate, and Test]]
+
* [[AI Solver]] ... [[Algorithms]] ... [[Algorithm Administration|Administration]] ... [[Model Search]] ... [[Discriminative vs. Generative]] ... [[Train, Validate, and Test]]
 
* [https://www.aitrends.com/ai-insider/machine-learning-benchmarks-and-ai-self-driving-cars/ Machine Learning Benchmarks and AI Self-Driving Cars | Lance Eliot - AItrends]
 
* [https://www.aitrends.com/ai-insider/machine-learning-benchmarks-and-ai-self-driving-cars/ Machine Learning Benchmarks and AI Self-Driving Cars | Lance Eliot - AItrends]
 
* [https://towardsdatascience.com/benchmarking-simple-machine-learning-models-with-feature-extraction-against-modern-black-box-80af734b31cc Benchmarking simple models with feature extraction against modern black-box methods | Martin Dittgen - Towards Data Science]
 
* [https://towardsdatascience.com/benchmarking-simple-machine-learning-models-with-feature-extraction-against-modern-black-box-80af734b31cc Benchmarking simple models with feature extraction against modern black-box methods | Martin Dittgen - Towards Data Science]
Line 45: Line 45:
 
* [https://www.sciencedirect.com/science/article/pii/S1532046418300716 Benchmarking deep learning models on large healthcare datasets | S. Purushotham, C. Meng, Z. Chea, and Y. Liu]
 
* [https://www.sciencedirect.com/science/article/pii/S1532046418300716 Benchmarking deep learning models on large healthcare datasets | S. Purushotham, C. Meng, Z. Chea, and Y. Liu]
 
* [https://spectrum.ieee.org/ai-supercomputer#toggle-gdpr Supercomputers Flex Their AI Muscles New benchmarks reveal science-task speedups | Sammuel K. Moore - IEEE Spectrum]
 
* [https://spectrum.ieee.org/ai-supercomputer#toggle-gdpr Supercomputers Flex Their AI Muscles New benchmarks reveal science-task speedups | Sammuel K. Moore - IEEE Spectrum]
 +
* [https://towardsdatascience.com/the-olympics-of-ai-benchmarking-machine-learning-systems-c4b2051fbd2b The Olympics of AI: Benchmarking Machine Learning Systems | Matthew Stewart - Towards Data Science - Medium] ... How do benchmarks birth breakthroughs?
  
 +
 +
<hr><center><b><i>
 +
 +
You can’t improve what you don’t measure.</i></b> — Peter Drucker
 +
 +
</center><hr>
  
  
Line 56: Line 63:
 
[https://www.bing.com/news/search?q=ai+~Consciousness+test&qft=interval%3d%228%22 ...Bing News]
 
[https://www.bing.com/news/search?q=ai+~Consciousness+test&qft=interval%3d%228%22 ...Bing News]
  
* [[Singularity]] ... [[Artificial Consciousness / Sentience|Sentience]] ... [[Artificial General Intelligence (AGI)| AGI]] ... [[Inside Out - Curious Optimistic Reasoning| Curious Reasoning]] ... [[Emergence]] ... [[Moonshots]] ... [[Explainable / Interpretable AI|Explainable AI]] ... [[Algorithm Administration#Automated Learning|Automated Learning]]
+
* [[Artificial General Intelligence (AGI) to Singularity]] ... [[Inside Out - Curious Optimistic Reasoning| Curious Reasoning]] ... [[Emergence]] ... [[Moonshots]] ... [[Explainable / Interpretable AI|Explainable AI]] ... [[Algorithm Administration#Automated Learning|Automated Learning]]
 
* [[Artificial Consciousness / Sentience#Theory of Mind (ToM)|Theory of Mind (ToM)]]
 
* [[Artificial Consciousness / Sentience#Theory of Mind (ToM)|Theory of Mind (ToM)]]
 +
* [[Perspective]] ... [[Context]] ... [[In-Context Learning (ICL)]] ... [[Transfer Learning]] ... [[Out-of-Distribution (OOD) Generalization]]
 +
* [[Causation vs. Correlation]] ... [[Autocorrelation]] ...[[Convolution vs. Cross-Correlation (Autocorrelation)]]
  
 
== <span id="Turing Test"></span>Turing Test ==
 
== <span id="Turing Test"></span>Turing Test ==
Line 98: Line 107:
 
|}
 
|}
 
|}<!-- B -->
 
|}<!-- B -->
 +
 +
== <span id="Chinese Room Thought Experiment"></span>Chinese Room Thought Experiment ==
 +
* [[Creatives#John Searle|John Searle]]
 +
 +
The Chinese Room Experiment is a thought experiment proposed by John Searle in 1980 to argue against the claim that a computer can have a mind or be conscious. Searle's argument has been criticized by some philosophers and computer scientists. However, it remains a powerful argument against the claim that computers can have a mind or be conscious.
 +
 +
In the experiment, Searle imagines himself locked in a room with a set of rules for manipulating Chinese symbols. The rules are written in English, which Searle understands, but the Chinese symbols are meaningless to him. He is given Chinese characters on slips of paper, which he then processes according to the rules. He then produces Chinese characters on slips of paper in response. To an outside observer, it would appear that Searle understands Chinese and is having a conversation with them. However, Searle himself does not understand Chinese at all. He is simply following the rules blindly.
 +
 +
Searle argues that this shows that a computer, which is essentially a machine that follows rules, cannot be said to understand Chinese or to have a mind. The computer may be able to produce intelligent-sounding output, but it does not have the same kind of understanding that a human being has.
 +
 +
The Chinese Room Experiment has been widely discussed and debated by philosophers and computer scientists. Some have argued that Searle's argument is flawed, while others have agreed with his conclusion. The Chinese Room Experiment is a complex and challenging thought experiment, and there is no easy answer to the question of whether or not it succeeds in its goal. However, it is a thought-provoking experiment that has helped to shape the debate about artificial intelligence and the nature of mind. Here are some of the key points of Searle's argument:
 +
 +
* Understanding a language is not just about manipulating symbols according to rules. It also requires having a grasp of the meaning of the symbols.
 +
* A computer can manipulate symbols, but it does not have the same kind of understanding that a human being has.
 +
* The Chinese Room Experiment shows that a computer cannot be said to understand Chinese, even if it can produce intelligent-sounding output.
 +
 +
<youtube>tBE06SdgzwM</youtube>
 +
<youtube>rHKwIYsPXLg</youtube>
  
 
= <span id="Large Language Model (LLM) Evaluation"></span>Large Language Model (LLM) Evaluation =
 
= <span id="Large Language Model (LLM) Evaluation"></span>Large Language Model (LLM) Evaluation =
Line 106: Line 133:
 
[https://www.bing.com/news/search?q=Evaluat+LLM+Large+Language+Model+Harness+framework&qft=interval%3d%228%22 ...Bing News]
 
[https://www.bing.com/news/search?q=Evaluat+LLM+Large+Language+Model+Harness+framework&qft=interval%3d%228%22 ...Bing News]
  
 +
* [[Large Language Model (LLM)]]
 +
* [[Conversational AI]] ... [[ChatGPT]] | [[OpenAI]] ... [[Bing/Copilot]] | [[Microsoft]] ... [[Gemini]] | [[Google]] ... [[Claude]] | [[Anthropic]] ... [[Perplexity]] ... [[You]] ... [[phind]] ... [[Ernie]] | [[Baidu]]
 
* [[Claude]] | [[Anthropic]]
 
* [[Claude]] | [[Anthropic]]
 
* [[Large Language Model (LLM)#LLM Token / Parameter / Weight|LLM Token / Parameter / Weight]]
 
* [[Large Language Model (LLM)#LLM Token / Parameter / Weight|LLM Token / Parameter / Weight]]
* [[In-Context Learning (ICL)]] ... [[Context]]
+
* [[In-Context Learning (ICL)]] ... [[Context]] ... [[Causation vs. Correlation]] ... [[Autocorrelation]] ... [[Out-of-Distribution (OOD) Generalization]] ... [[Transfer Learning]]
 
* [https://crfm.stanford.edu/helm/latest/ Holistic Evaluation of Language Models (HELM) | Stanford] ... a living benchmark that aims to improve the transparency of language models.
 
* [https://crfm.stanford.edu/helm/latest/ Holistic Evaluation of Language Models (HELM) | Stanford] ... a living benchmark that aims to improve the transparency of language models.
 
* [https://www.mosaicml.com/blog/llm-evaluation-for-icl Blazingly Fast LLM Evaluation for In-Context Learning | Jeremy Dohmann - Mosaic]
 
* [https://www.mosaicml.com/blog/llm-evaluation-for-icl Blazingly Fast LLM Evaluation for In-Context Learning | Jeremy Dohmann - Mosaic]
 
* [https://github.com/openai/evals Evals - GitHub] ... a framework for evaluating LLMs (large language models) or systems built using LLMs as components.
 
* [https://github.com/openai/evals Evals - GitHub] ... a framework for evaluating LLMs (large language models) or systems built using LLMs as components.
 
* [https://wandb.ai/wandb_gen/llm-evaluation/reports/Evaluating-Large-Language-Models-LLMs-with-Eleuther-AI--VmlldzoyOTI0MDQ3 Evaluating Large Language Models (LLMs) with Eleuther AI | Bharat Ramanathan - Weights & Biases] ... With a flexible and tokenization-agnostic interface, the lm-eval library provides a single framework for evaluating and reporting auto-regressive language models on various [[Natural Language Processing (NLP)#Natural Language Understanding (NLU)| Natural Language Understanding (NLU)]] tasks. There are currently over 200 evaluation tasks that support the evaluation of models such as GPT-2 ,T5, Gpt-J, Gpt-Neo, Gpt-NeoX, Flan-T5.
 
* [https://wandb.ai/wandb_gen/llm-evaluation/reports/Evaluating-Large-Language-Models-LLMs-with-Eleuther-AI--VmlldzoyOTI0MDQ3 Evaluating Large Language Models (LLMs) with Eleuther AI | Bharat Ramanathan - Weights & Biases] ... With a flexible and tokenization-agnostic interface, the lm-eval library provides a single framework for evaluating and reporting auto-regressive language models on various [[Natural Language Processing (NLP)#Natural Language Understanding (NLU)| Natural Language Understanding (NLU)]] tasks. There are currently over 200 evaluation tasks that support the evaluation of models such as GPT-2 ,T5, Gpt-J, Gpt-Neo, Gpt-NeoX, Flan-T5.
 
  
 
Benchmarks for an LLM:
 
Benchmarks for an LLM:
Line 124: Line 152:
 
* <b>Code Generation on HumanEval</b>: This test evaluates the LLM's ability to write code that meets certain criteria, such as correctness and efficiency. Claude 2 on Python coding test: 71.2%
 
* <b>Code Generation on HumanEval</b>: This test evaluates the LLM's ability to write code that meets certain criteria, such as correctness and efficiency. Claude 2 on Python coding test: 71.2%
 
* <b>GSM8k math problem set</b>: This problem set evaluates the LLM's ability to solve mathematical problems of varying difficulty. Claude 2: 88%
 
* <b>GSM8k math problem set</b>: This problem set evaluates the LLM's ability to solve mathematical problems of varying difficulty. Claude 2: 88%
 +
  
 
There are several factors that should be considered while evaluating Large Language Models (LLMs). These include:  
 
There are several factors that should be considered while evaluating Large Language Models (LLMs). These include:  
Line 137: Line 166:
 
* text operations
 
* text operations
  
== Backtracking ==
+
== A Survey on Evaluation of Large Language Models ==
Backtracking is a general algorithmic technique that considers searching every possible combination in order to solve a computational problem. It incrementally builds candidates to the solutions and abandons a candidate’s backtracks as soon as it determines that the candidate cannot be completed to a reasonable solution. In machine learning, backtracking can be used to solve constraint satisfaction problems, such as crosswords, verbal arithmetic, Sudoku, and many other puzzles.
+
* [https://arxiv.org/pdf/2307.03109.pdf A Survey on Evaluation of Large Language Models | Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang, Y. Chang,P. Yu, Q. Yang, X. Xie - ARxIV] ... The survey covers seven major categories of LLM trustworthiness:
 +
** <b>Reliability</b>: LLMs should be able to consistently generate accurate and truthful outputs, even when presented with new or challenging inputs.
 +
** <b>Safety</b>: LLMs should not generate outputs that are harmful or dangerous, such as outputs that promote violence or hate speech.
 +
** <b>Fairness</b>: LLMs should not discriminate against any individual or group of individuals, regardless of their race, gender, sexual orientation, or other protected characteristics.
 +
** <b>Resistance to misuse</b>: LLMs should be designed in a way that makes it difficult for them to be used for malicious purposes, such as generating fake news or propaganda.
 +
** <b>Explainability and reasoning</b>: LLMs should be able to explain their reasoning behind their outputs, so that users can understand how they work and make informed decisions about how to use them.
 +
** <b>Adherence to social norms</b>: LLMs should generate outputs that are consistent with social norms and values, such as avoiding offensive language or promoting harmful stereotypes.
 +
** <b>Robustness</b>: LLMs should be able to withstand attacks and manipulation, such as being fed deliberately misleading or harmful data.
 +
 
 +
 
 +
== Popular Benchmarks for Testing LLMs ==
 +
 
 +
* <b>[https://leaderboard.allenai.org/arc/submissions/public AI2 Reasoning Challenge (ARC)]</b>:designed to promote research in advanced question-answering, particularly questions that require reasoning. The ARC dataset consists of 7,787 science exam questions from grade 3 to grade 9, with a supporting knowledge base of 14.3M unstructured text passages. The benchmark evaluates the performance of LLMs in answering multiple-choice questions.
 +
* <b>[https://winogrande.allenai.org/ WinoGrande]</b>: evaluate the ability of LLMs to perform commonsense reasoning. The benchmark consists of 44,000 examples that require the model to understand the meaning of words in context and to reason about the relationships between entities.
 +
* <b>[https://leaderboard.allenai.org/arb Advanced Reasoning Benchmark (ARB)]</b>: evaluate the ability of LLMs to perform complex reasoning tasks. The benchmark consists of 1,000 examples that require the model to perform multi-step reasoning and to integrate information from multiple sources.
 +
* <b>[https://huggingface.co/datasets/holistic_evaluation_of_language_models Holistic Evaluation of Language Models (HELM)]</b>: evaluate the performance of LLMs in multiple tasks, including language modeling, question answering, and summarization. The benchmark consists of 57 datasets covering a wide range of tasks and domains.
 +
* <b>[https://github.com/google-research/big-bench Big Bench]</b>: evaluate the performance of LLMs in a wide range of tasks, including language modeling, question answering, and summarization. The benchmark consists of 800 diverse tasks that require the model to perform complex reasoning and to integrate information from multiple sources.
 +
* <b>[https://github.com/google-research/mmlu Massive Multitask Language Understanding (MMLU)]</b>: evaluate the performance of LLMs in multiple tasks, including language modeling, question answering, and summarization. The benchmark consists of 20 diverse tasks that require the model to perform complex reasoning and to integrate information from multiple sources.
 +
* <b>[https://rajpurkar.github.io/SQuAD-explorer/ SQuAD]</b>: tests LLMs on their ability to answer questions about a given passage of text. The SQuAD dataset is a collection of questions and answers that are created by crowdworkers on a set of Wikipedia articles.
 +
* <b>[https://gluebenchmark.com/ GLUE]</b>: tests LLMs on a variety of natural language understanding tasks, including sentiment analysis, text classification, and question answering.
 +
* <b>[https://super.gluebenchmark.com/ SuperGLUE]</b>: an extension of the GLUE benchmark that includes more challenging tasks. The tasks in the benchmark are:
 +
** CoLA: Corpus of Linguistic Acceptability
 +
** SST-2: Stanford Sentiment Treebank
 +
** MRPC: Microsoft Research Paraphrase Corpus
 +
** STS-B: Semantic Textual Similarity Benchmark
 +
** QQP: Quora Question Pairs
 +
** MNLI: MultiNLI
 +
** QNLI: Question Natural Language Inference
 +
** RTE: Recognizing Textual Entailment
 +
** WNLI: Winograd Schema Challenge
 +
** AX: Adversarial Textual Entailment
 +
 
  
 +
== Evaluating Large Language Models on Clinical & Biomedical NLP Benchmarks ==
 
<youtube>Big_txmH7Rc</youtube>
 
<youtube>Big_txmH7Rc</youtube>
<youtube>Wc7dcwF7QaA</youtube>
+
 
 +
== Evaluating Large Language Models on Legal Reasoning ==
 +
* [https://arxiv.org/abs/2308.11462 LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models | Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H. Choi, Kevin Tobia, Margaret Hagan, Megan Ma, Michael Livermore, Nikon Rasumov-Rahe, Nils Holzenberger, Noam Kolt, Peter Henderson, Sean Rehaag, Sharad Goel, Shang Gao, Spencer Williams, Sunny Gandhi, Tom Zur, Varun Iyer, Zehua Li]
 +
 
 +
<b>LegalBench:</b> The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning -- which distinguish between its many forms -- correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.
  
 
= <span id="Natural Language Processing (NLP) Evaluation"></span>Natural Language Processing (NLP) Evaluation =
 
= <span id="Natural Language Processing (NLP) Evaluation"></span>Natural Language Processing (NLP) Evaluation =
 +
* [[Natural Language Processing (NLP)]] ... [[Natural Language Generation (NLG)|Generation]] ... [[Natural Language Classification (NLC)|Classification]] ... [[Natural Language Processing (NLP)#Natural Language Understanding (NLU)|Understanding]] ... [[Language Translation|Translation]] ...  [[Natural Language Tools & Services|Tools & Services]]
  
 
== <span id="GLUE"></span>General Language Understanding Evaluation (GLUE) ==
 
== <span id="GLUE"></span>General Language Understanding Evaluation (GLUE) ==
Line 184: Line 250:
 
|}<!-- B -->
 
|}<!-- B -->
  
= Machine Learning Model =
+
= Machine Learning Evaluation =
<img src="https://www.researchgate.net/profile/Benoit_Gallix/publication/324457640/figure/fig1/AS:622298201595905@1525378861825/Graph-illustrating-the-impact-of-data-available-on-performance-of-traditional-machine.png" width="500" height="400">
+
 
 +
<youtube>WlXhpXv9kDU</youtube>
 +
<youtube>wpQiEHYkBys</youtube>
 +
<youtube>lgK0BlXdOCw</youtube>
 +
 
 +
 
 +
== Procgen ==
 +
* [[OpenAI]]
 +
* [https://venturebeat.com/2019/12/03/openais-procgen-benchmark-overfitting/ OpenAI’s Procgen Benchmark prevents AI model overfitting | Kyle Wiggers - VentureBeat] a set of 16 procedurally generated environments that measure how quickly a model learns generalizable skills. It builds atop the startup’s CoinRun toolset, which used procedural generation to construct sets of training and test levels.
 +
 
 +
https://venturebeat.com/wp-content/uploads/2019/12/ezgif-4-3630016ea205.gif
 +
 
 +
[[OpenAI]] previously released [https://venturebeat.com/2019/03/04/openai-launches-neural-mmo-a-massive-reinforcement-learning-simulator/ Neural MMO], a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available [https://venturebeat.com/2019/11/21/openai-safety-gym/ SafetyGym], a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.
 +
 
 +
= Human Evaluation =
 +
<b>CAPTCHA</b> stands for "Completely Automated Public Turing test to tell Computers and Humans Apart". It's a security measure that helps protect users from spam and password decryption by verifying that a user is human and not a computer.
 +
 
 +
<youtube>9k-uPSEGl-c</youtube>
 +
 
 +
== I'm not a robot ==
 +
* [https://youtube.com/shorts/rme6PT7-CRI?si=JQnH2Xs6rkIpUPND  I'm not a robot ... explained]
  
 +
No CAPTCHA reCAPTCHA: Popularized by Google, this involves a simple checkbox labeled "I am not a robot." It works by analyzing user behavior, such as mouse movements to determine if the user is human. If the test is inconclusive, a more traditional image selection CAPTCHA is presented. When you click the checkbox, reCAPTCHA monitors:
  
{|<!-- T -->
+
* Mouse movements: Human mouse movements tend to be unpredictable, while bots often exhibit linear or mechanical movements.
| valign="top" |
+
* Click timing: Humans have natural delays in their actions, while bots execute them at near-instantaneous speeds.
{| class="wikitable" style="width: 550px;"
+
 
||
+
== Improving AI while trying to outsmart it ==
<youtube>YygGzfkhtJc</youtube>
+
The so-called "bot test"—like CAPTCHAs, where users identify objects in images or complete other seemingly trivial tasks—has a dual purpose. While it's meant to distinguish between humans and bots, the data collected often helps train AI systems to improve at tasks like image recognition, text understanding, or problem-solving. the effectiveness of CAPTCHAs is constantly being challenged by advancements in artificial intelligence and machine learning. Recent research has demonstrated that advanced AI can effectively solve image-based CAPTCHAs, such as Google's reCAPTCHAv2, with a 100% success rate using YOLO models for image segmentation and classification. This highlights the need for CAPTCHA systems to evolve in response to AI advancements.
<b>Lecture 13 – Evaluation Metrics | Stanford CS224U: Natural Language Understanding | Spring 2019
+
 
</b><br>Professor Christopher Potts  Professor of Linguistics and, by courtesy, Computer Science  Director, Stanford Center for the Study of Language and Information Consulting Assistant Professor Bill MacCartney
+
In a way, humans doing these tests are teaching the bots to get better at beating the tests themselves. It's a fascinating cycle of humans improving AI while trying to outsmart it! Irony at its finest.
Senior Engineering Manager, [[Apple]]
 
|}
 
|<!-- M -->
 
| valign="top" |
 
{| class="wikitable" style="width: 550px;"
 
||
 
<youtube>WlXhpXv9kDU</youtube>
 
<b>Kaggle Reading Group : An Open Source AutoML Benchmark | Kaggle
 
</b><br>This week we're starting a new paper: An Open Source AutoML Benchmark by Gijsbers et al from the 2019 ICML Workshop on Automated Machine Learning.
 
|}
 
|}<!-- B -->
 
  
 +
== Future Prospects and Innovations ==
 +
The future of human evaluation CAPTCHA techniques is likely to be shaped by ongoing technological advancements and the need to balance security with user experience. Some promising developments include:
 +
* Advanced AI and Machine Learning Techniques: As AI becomes more sophisticated in solving CAPTCHAs, new techniques are being developed to create CAPTCHA-resistant challenges that can adapt to evolving bot strategies.
 +
* Invisible CAPTCHA: Google's reCAPTCHA v3 represents a significant innovation by eliminating visible challenges for users. Instead, it continuously monitors user behavior to assess the likelihood of a bot interaction, providing a score between 0 and 1.
 +
* Cognitive Deep-Learning CAPTCHA: A 2023 study introduced a new CAPTCHA system that combines text-based, image-based, and cognitive CAPTCHA characteristics. This system employs adversarial examples and neural style transfer to enhance security, making it more resistant to automated attacks.
 +
* Behavioral Analysis and Biometric Verification: Innovations are exploring the use of behavioral analysis to distinguish human actions from bot interactions without explicit challenges. Biometric identification is also being considered for seamless user authentication, leveraging unique user characteristics.
 +
* AI-Powered Solutions: AI algorithms are being developed to create CAPTCHA-resistant challenges that can adapt to evolving bot strategies. This includes employing AI to design intelligent algorithms that better distinguish bot activity from human input.
  
{|<!-- T -->
+
= Evaluating Machine Learning (ML) Hardware, Software, and Services =
| valign="top" |
+
* [[What is Artificial Intelligence (AI)? | Artificial Intelligence (AI)]] ... [[Machine Learning (ML)]] ... [[Deep Learning]] ... [[Neural Network]] ... [[Reinforcement Learning (RL)|Reinforcement]] ... [[Learning Techniques]]
{| class="wikitable" style="width: 550px;"
 
||
 
<youtube>wpQiEHYkBys</youtube>
 
<b>Machine Learning Model Evaluation Metrics
 
</b><br>MARIA KHALUSOVA | DEVELOPER ADVOCATE AT JETBRAINS Choosing the right evaluation metric for your machine learning project is crucial, as it decides which model you’ll ultimately use. Those coming to ML from software [[development]] are often self-taught, but practice exercises and competitions generally dictate the evaluation metric. In a real-world scenario, how do you choose an appropriate metric? This talk will explore the important evaluation metrics used in regression and classification tasks, their pros and cons, and how to make a smart decision.
 
|}
 
|<!-- M -->
 
| valign="top" |
 
{| class="wikitable" style="width: 550px;"
 
||
 
<youtube>lgK0BlXdOCw</youtube>
 
<b>Characterization and Benchmarking of Deep Learning
 
</b><br>In this video from the HPC User Forum in Milwaukee, Natalia Vassilieva from HP Labs presents: Characterization and Benchmarking of Deep Learning.
 
|}
 
|}<!-- B -->
 
  
= Measuring training and inference performance of ML hardware, software, and services =
 
 
== <span id="MLPerf"></span>MLPerf ==
 
== <span id="MLPerf"></span>MLPerf ==
 
* [https://mlperf.org/ MLPerf] benchmarks for measuring training and inference performance of ML hardware, software, and services.
 
* [https://mlperf.org/ MLPerf] benchmarks for measuring training and inference performance of ML hardware, software, and services.
Line 262: Line 329:
 
<youtube>sH03-InVba4</youtube>
 
<youtube>sH03-InVba4</youtube>
 
<b>Exploring the Impact of System Storage on AI & ML Workloads via MLPerf Benchmark Suite
 
<b>Exploring the Impact of System Storage on AI & ML Workloads via MLPerf Benchmark Suite
</b><br>Wes Vaske  This is the presentation I gave at Flash Memory Summit 2019 in the AI/ML track.  In it I discuss some benchmark results that I've collected over the past year at Micron from running the MLPerf benchmark suite.  AIML-301-1:Using AI/ML for Flash Performance Scaling, Part 1 -  
+
</b><br>Wes Vaske  This is the presentation I gave at Flash [[Memory]] Summit 2019 in the AI/ML track.  In it I discuss some benchmark results that I've collected over the past year at Micron from running the MLPerf benchmark suite.  AIML-301-1:Using AI/ML for Flash Performance Scaling, Part 1 -  
 
|}
 
|}
 
|}<!-- B -->
 
|}<!-- B -->
  
= Procgen =
+
= Backtracking =
* [[OpenAI]]
+
Backtracking is a general algorithmic technique that considers searching every possible combination in order to solve a computational problem. It incrementally builds candidates to the solutions and abandons a candidate’s backtracks as soon as it determines that the candidate cannot be completed to a reasonable solution. In machine learning, backtracking can be used to solve constraint satisfaction problems, such as crosswords, verbal arithmetic, Sudoku, and many other puzzles.
* [https://venturebeat.com/2019/12/03/openais-procgen-benchmark-overfitting/ OpenAI’s Procgen Benchmark prevents AI model overfitting | Kyle Wiggers - VentureBeat] a set of 16 procedurally generated environments that measure how quickly a model learns generalizable skills. It builds atop the startup’s CoinRun toolset, which used procedural generation to construct sets of training and test levels.
 
  
https://venturebeat.com/wp-content/uploads/2019/12/ezgif-4-3630016ea205.gif
 
  
[[OpenAI]] previously released [https://venturebeat.com/2019/03/04/openai-launches-neural-mmo-a-massive-reinforcement-learning-simulator/ Neural MMO], a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available [https://venturebeat.com/2019/11/21/openai-safety-gym/ SafetyGym], a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.
+
<youtube>Big_txmH7Rc</youtube>
 +
<youtube>Wc7dcwF7QaA</youtube>
  
  
Line 283: Line 349:
 
<youtube>Ty9n4XGe6WA</youtube>
 
<youtube>Ty9n4XGe6WA</youtube>
 
<youtube>SXSyFFrfzMM</youtube>
 
<youtube>SXSyFFrfzMM</youtube>
<youtube>EOcD2q37qgc</youtube>
 
<youtube>1LP0jGGEYSI</youtube>
 

Latest revision as of 08:57, 22 November 2024

YouTube ... Quora ...Google search ...Google News ...Bing News



You can’t improve what you don’t measure. — Peter Drucker



AI Consciousness Testing

YouTube ... Quora ...Google search ...Google News ...Bing News

Turing Test

The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation was a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel, such as a computer keyboard and screen, so the result would not depend on the machine's ability to render words as speech. If the evaluator could not reliably tell the machine from the human, the machine would be said to have passed the test. The test results would not depend on the machine's ability to give correct answers to questions, only on how closely its answers resembled those a human would give. - Turing Test | Wikipedia



Today an AI has to dumb down to pass the Turing Test - Ray Kurzweil



What is a Turing Test? A Brief History of the Turing Test and its Impact
What is a Turing Test

Is a computer as smart as a human? Only a Turing Test will tell -- plus its many spin-offs. A Turing Test is a method of determining whether a computer is capable of thinking like a human. Watch to learn what a Turing Test is and how it relates to AI technology.

Will ChatGPT Pass The Turing Test? Let's Find Out!
I have been testing ChatGPT for the past few days and it has been nothing short of spectacular. Now the moment of truth is upon is: Will it pass the Turing Test? Find out in this video. Will it exhibit intelligence that will fool humans into thinking it's not a machine? You'd be surprised!

ChatGPT says:

"The Turing Test is a measure of a machine's ability to exhibit intelligent behavior that is indistinguishable from that of a human. It was first proposed by the British mathematician and computer scientist Alan Turing in 1950. The basic idea of the test is that a human evaluator engages in a text-based conversation with both a human and a machine, without knowing which is which. If the evaluator is unable to reliably determine which is the human and which is the machine, then the machine is said to have passed the Turing Test and demonstrated human-like intelligence.

The Turing Test has become an influential concept in the field of artificial intelligence and continues to be an active area of research and development. While some AI systems have been able to fool evaluators into thinking they are human in limited cases, no machine has yet passed the Turing Test in a comprehensive and sustained manner. Nonetheless, the Turing Test remains a useful benchmark for evaluating the progress of AI and a means for stimulating discussion about the nature of human intelligence and the potential for machines to possess similar capabilities."

Chinese Room Thought Experiment

The Chinese Room Experiment is a thought experiment proposed by John Searle in 1980 to argue against the claim that a computer can have a mind or be conscious. Searle's argument has been criticized by some philosophers and computer scientists. However, it remains a powerful argument against the claim that computers can have a mind or be conscious.

In the experiment, Searle imagines himself locked in a room with a set of rules for manipulating Chinese symbols. The rules are written in English, which Searle understands, but the Chinese symbols are meaningless to him. He is given Chinese characters on slips of paper, which he then processes according to the rules. He then produces Chinese characters on slips of paper in response. To an outside observer, it would appear that Searle understands Chinese and is having a conversation with them. However, Searle himself does not understand Chinese at all. He is simply following the rules blindly.

Searle argues that this shows that a computer, which is essentially a machine that follows rules, cannot be said to understand Chinese or to have a mind. The computer may be able to produce intelligent-sounding output, but it does not have the same kind of understanding that a human being has.

The Chinese Room Experiment has been widely discussed and debated by philosophers and computer scientists. Some have argued that Searle's argument is flawed, while others have agreed with his conclusion. The Chinese Room Experiment is a complex and challenging thought experiment, and there is no easy answer to the question of whether or not it succeeds in its goal. However, it is a thought-provoking experiment that has helped to shape the debate about artificial intelligence and the nature of mind. Here are some of the key points of Searle's argument:

  • Understanding a language is not just about manipulating symbols according to rules. It also requires having a grasp of the meaning of the symbols.
  • A computer can manipulate symbols, but it does not have the same kind of understanding that a human being has.
  • The Chinese Room Experiment shows that a computer cannot be said to understand Chinese, even if it can produce intelligent-sounding output.

Large Language Model (LLM) Evaluation

YouTube ... Quora ...Google search ...Google News ...Bing News

Benchmarks for an LLM:

  • Ability to add attachments to prompts: attachments, such as images or documents, the use of attachments allows an LLM to incorporate additional information beyond the textual prompt, which can improve its ability to generate accurate and relevant responses. Claude 2: prompts can include attachments
  • Performance on the bar exam multiple-choice section: The bar exam is a standardized test that is required to practice law in the United States. The multiple-choice section tests knowledge of legal concepts and principles. Claude 2: Scored 76.5%
  • Performance on the GRE reading and writing exams: a standardized test that is often required for admission to graduate programs. The reading and writing sections test reading comprehension, analytical writing, and critical thinking skills. Claude 2: A score above the 90th percentile indicates that the LLM is highly proficient in these skills.
  • Performance on the GRE quantitative reasoning exam: This section tests mathematical and analytical skills. Claude 2: A score similar to the median applicant indicates that the LLM has average proficiency in these skills.
  • Input length limit: the maximum length of the input prompt that an LLM can handle. A token is a sequence of characters that represents a unit of meaning in natural language processing. Claude 2: A limit of 100K tokens per prompt means that the LLM can handle prompts of up to 100,000 tokens in length.
  • Context window limit: Maximum amount of context that an LLM can consider when generating a response to a prompt. Claude 2: A context window of up to 100K means that the LLM can consider up to 100,000 tokens of context when generating a response.
  • Code Generation on HumanEval: This test evaluates the LLM's ability to write code that meets certain criteria, such as correctness and efficiency. Claude 2 on Python coding test: 71.2%
  • GSM8k math problem set: This problem set evaluates the LLM's ability to solve mathematical problems of varying difficulty. Claude 2: 88%


There are several factors that should be considered while evaluating Large Language Models (LLMs). These include:

  • authenticity
  • speed
  • grammar
  • readability
  • unbiasedness
  • backtracking
  • safety
  • responsibility
  • understanding the context
  • text operations

A Survey on Evaluation of Large Language Models

  • A Survey on Evaluation of Large Language Models | Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang, Y. Chang,P. Yu, Q. Yang, X. Xie - ARxIV ... The survey covers seven major categories of LLM trustworthiness:
    • Reliability: LLMs should be able to consistently generate accurate and truthful outputs, even when presented with new or challenging inputs.
    • Safety: LLMs should not generate outputs that are harmful or dangerous, such as outputs that promote violence or hate speech.
    • Fairness: LLMs should not discriminate against any individual or group of individuals, regardless of their race, gender, sexual orientation, or other protected characteristics.
    • Resistance to misuse: LLMs should be designed in a way that makes it difficult for them to be used for malicious purposes, such as generating fake news or propaganda.
    • Explainability and reasoning: LLMs should be able to explain their reasoning behind their outputs, so that users can understand how they work and make informed decisions about how to use them.
    • Adherence to social norms: LLMs should generate outputs that are consistent with social norms and values, such as avoiding offensive language or promoting harmful stereotypes.
    • Robustness: LLMs should be able to withstand attacks and manipulation, such as being fed deliberately misleading or harmful data.


Popular Benchmarks for Testing LLMs

  • AI2 Reasoning Challenge (ARC):designed to promote research in advanced question-answering, particularly questions that require reasoning. The ARC dataset consists of 7,787 science exam questions from grade 3 to grade 9, with a supporting knowledge base of 14.3M unstructured text passages. The benchmark evaluates the performance of LLMs in answering multiple-choice questions.
  • WinoGrande: evaluate the ability of LLMs to perform commonsense reasoning. The benchmark consists of 44,000 examples that require the model to understand the meaning of words in context and to reason about the relationships between entities.
  • Advanced Reasoning Benchmark (ARB): evaluate the ability of LLMs to perform complex reasoning tasks. The benchmark consists of 1,000 examples that require the model to perform multi-step reasoning and to integrate information from multiple sources.
  • Holistic Evaluation of Language Models (HELM): evaluate the performance of LLMs in multiple tasks, including language modeling, question answering, and summarization. The benchmark consists of 57 datasets covering a wide range of tasks and domains.
  • Big Bench: evaluate the performance of LLMs in a wide range of tasks, including language modeling, question answering, and summarization. The benchmark consists of 800 diverse tasks that require the model to perform complex reasoning and to integrate information from multiple sources.
  • Massive Multitask Language Understanding (MMLU): evaluate the performance of LLMs in multiple tasks, including language modeling, question answering, and summarization. The benchmark consists of 20 diverse tasks that require the model to perform complex reasoning and to integrate information from multiple sources.
  • SQuAD: tests LLMs on their ability to answer questions about a given passage of text. The SQuAD dataset is a collection of questions and answers that are created by crowdworkers on a set of Wikipedia articles.
  • GLUE: tests LLMs on a variety of natural language understanding tasks, including sentiment analysis, text classification, and question answering.
  • SuperGLUE: an extension of the GLUE benchmark that includes more challenging tasks. The tasks in the benchmark are:
    • CoLA: Corpus of Linguistic Acceptability
    • SST-2: Stanford Sentiment Treebank
    • MRPC: Microsoft Research Paraphrase Corpus
    • STS-B: Semantic Textual Similarity Benchmark
    • QQP: Quora Question Pairs
    • MNLI: MultiNLI
    • QNLI: Question Natural Language Inference
    • RTE: Recognizing Textual Entailment
    • WNLI: Winograd Schema Challenge
    • AX: Adversarial Textual Entailment


Evaluating Large Language Models on Clinical & Biomedical NLP Benchmarks

Evaluating Large Language Models on Legal Reasoning

LegalBench: The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning -- which distinguish between its many forms -- correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.

Natural Language Processing (NLP) Evaluation

General Language Understanding Evaluation (GLUE)

The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems... like picking out the names of people and organizations in a sentence and figuring out what a pronoun like “it” refers to when there are multiple potential antecedents. GLUE consists of:

  • A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty,
  • A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and
  • A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.

State of the Art in Natural Language Processing (NLP)
Jeff Heaton Algorithms such as BERT, T4, ERNIE, and others claim to be the state of the art for NLP programs. But what does this mean? How is this evaluated. In this video I look at GLUE and other NLP benchmarks.

The Stanford Question Answering Dataset (SQuAD)

Applying BERT to Question Answering (SQuAD v1.1)
In this video I’ll explain the details of how BERT is used to perform “Question Answering”--specifically, how it’s applied to SQuAD v1.1 (Stanford Question Answering Dataset). I’ll also walk us through the following notebook, where we’ll take a model that’s already been fine-tuned on SQuAD, and apply it to our own questions and text.

Question and Answering System for the SQuAD Dataset
CS224N default final project presentation

Machine Learning Evaluation


Procgen

ezgif-4-3630016ea205.gif

OpenAI previously released Neural MMO, a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available SafetyGym, a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.

Human Evaluation

CAPTCHA stands for "Completely Automated Public Turing test to tell Computers and Humans Apart". It's a security measure that helps protect users from spam and password decryption by verifying that a user is human and not a computer.

I'm not a robot

No CAPTCHA reCAPTCHA: Popularized by Google, this involves a simple checkbox labeled "I am not a robot." It works by analyzing user behavior, such as mouse movements to determine if the user is human. If the test is inconclusive, a more traditional image selection CAPTCHA is presented. When you click the checkbox, reCAPTCHA monitors:

  • Mouse movements: Human mouse movements tend to be unpredictable, while bots often exhibit linear or mechanical movements.
  • Click timing: Humans have natural delays in their actions, while bots execute them at near-instantaneous speeds.

Improving AI while trying to outsmart it

The so-called "bot test"—like CAPTCHAs, where users identify objects in images or complete other seemingly trivial tasks—has a dual purpose. While it's meant to distinguish between humans and bots, the data collected often helps train AI systems to improve at tasks like image recognition, text understanding, or problem-solving. the effectiveness of CAPTCHAs is constantly being challenged by advancements in artificial intelligence and machine learning. Recent research has demonstrated that advanced AI can effectively solve image-based CAPTCHAs, such as Google's reCAPTCHAv2, with a 100% success rate using YOLO models for image segmentation and classification. This highlights the need for CAPTCHA systems to evolve in response to AI advancements.

In a way, humans doing these tests are teaching the bots to get better at beating the tests themselves. It's a fascinating cycle of humans improving AI while trying to outsmart it! Irony at its finest.

Future Prospects and Innovations

The future of human evaluation CAPTCHA techniques is likely to be shaped by ongoing technological advancements and the need to balance security with user experience. Some promising developments include:

  • Advanced AI and Machine Learning Techniques: As AI becomes more sophisticated in solving CAPTCHAs, new techniques are being developed to create CAPTCHA-resistant challenges that can adapt to evolving bot strategies.
  • Invisible CAPTCHA: Google's reCAPTCHA v3 represents a significant innovation by eliminating visible challenges for users. Instead, it continuously monitors user behavior to assess the likelihood of a bot interaction, providing a score between 0 and 1.
  • Cognitive Deep-Learning CAPTCHA: A 2023 study introduced a new CAPTCHA system that combines text-based, image-based, and cognitive CAPTCHA characteristics. This system employs adversarial examples and neural style transfer to enhance security, making it more resistant to automated attacks.
  • Behavioral Analysis and Biometric Verification: Innovations are exploring the use of behavioral analysis to distinguish human actions from bot interactions without explicit challenges. Biometric identification is also being considered for seamless user authentication, leveraging unique user characteristics.
  • AI-Powered Solutions: AI algorithms are being developed to create CAPTCHA-resistant challenges that can adapt to evolving bot strategies. This includes employing AI to design intelligent algorithms that better distinguish bot activity from human input.

Evaluating Machine Learning (ML) Hardware, Software, and Services

MLPerf

MLPerf: A Benchmark Suite for Machine Learning - Gu-Yeon Wei (Harvard University)
O'Reilly

MLPerf: A Benchmark Suite for Machine Learning - David Patterson (UC Berkeley)
O'Reilly

MLPerf Benchmarks
Geoff Tate, CEO of Flex Logix, talks about the new MLPerf benchmark, what’s missing from the benchmark, and which ones are relevant to edge inferencing.

Exploring the Impact of System Storage on AI & ML Workloads via MLPerf Benchmark Suite
Wes Vaske This is the presentation I gave at Flash Memory Summit 2019 in the AI/ML track. In it I discuss some benchmark results that I've collected over the past year at Micron from running the MLPerf benchmark suite. AIML-301-1:Using AI/ML for Flash Performance Scaling, Part 1 -

Backtracking

Backtracking is a general algorithmic technique that considers searching every possible combination in order to solve a computational problem. It incrementally builds candidates to the solutions and abandons a candidate’s backtracks as soon as it determines that the candidate cannot be completed to a reasonable solution. In machine learning, backtracking can be used to solve constraint satisfaction problems, such as crosswords, verbal arithmetic, Sudoku, and many other puzzles.



American Productivity & Quality Center (APQC)

APQC provides the information, data, and insights organizations need to work smarter, faster, and with greater confidence. A non-profit organization, we provide independent, unbiased, and validated research and data to our more than 1,000 organizational members in 45 industries worldwide. Our members have exclusive access to the world’s largest set of benchmark data, with more than 4,000,000 data points. \