Benchmarks

From
Jump to: navigation, search

YouTube ... Quora ...Google search ...Google News ...Bing News



You can’t improve what you don’t measure. — Peter Drucker



AI Consciousness Testing

YouTube ... Quora ...Google search ...Google News ...Bing News

Turing Test

The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation was a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel, such as a computer keyboard and screen, so the result would not depend on the machine's ability to render words as speech. If the evaluator could not reliably tell the machine from the human, the machine would be said to have passed the test. The test results would not depend on the machine's ability to give correct answers to questions, only on how closely its answers resembled those a human would give. - Turing Test | Wikipedia



Today an AI has to dumb down to pass the Turing Test - Ray Kurzweil



What is a Turing Test? A Brief History of the Turing Test and its Impact
What is a Turing Test

Is a computer as smart as a human? Only a Turing Test will tell -- plus its many spin-offs. A Turing Test is a method of determining whether a computer is capable of thinking like a human. Watch to learn what a Turing Test is and how it relates to AI technology.

Will ChatGPT Pass The Turing Test? Let's Find Out!
I have been testing ChatGPT for the past few days and it has been nothing short of spectacular. Now the moment of truth is upon is: Will it pass the Turing Test? Find out in this video. Will it exhibit intelligence that will fool humans into thinking it's not a machine? You'd be surprised!

ChatGPT says:

"The Turing Test is a measure of a machine's ability to exhibit intelligent behavior that is indistinguishable from that of a human. It was first proposed by the British mathematician and computer scientist Alan Turing in 1950. The basic idea of the test is that a human evaluator engages in a text-based conversation with both a human and a machine, without knowing which is which. If the evaluator is unable to reliably determine which is the human and which is the machine, then the machine is said to have passed the Turing Test and demonstrated human-like intelligence.

The Turing Test has become an influential concept in the field of artificial intelligence and continues to be an active area of research and development. While some AI systems have been able to fool evaluators into thinking they are human in limited cases, no machine has yet passed the Turing Test in a comprehensive and sustained manner. Nonetheless, the Turing Test remains a useful benchmark for evaluating the progress of AI and a means for stimulating discussion about the nature of human intelligence and the potential for machines to possess similar capabilities."

Chinese Room Thought Experiment

The Chinese Room Experiment is a thought experiment proposed by John Searle in 1980 to argue against the claim that a computer can have a mind or be conscious. Searle's argument has been criticized by some philosophers and computer scientists. However, it remains a powerful argument against the claim that computers can have a mind or be conscious.

In the experiment, Searle imagines himself locked in a room with a set of rules for manipulating Chinese symbols. The rules are written in English, which Searle understands, but the Chinese symbols are meaningless to him. He is given Chinese characters on slips of paper, which he then processes according to the rules. He then produces Chinese characters on slips of paper in response. To an outside observer, it would appear that Searle understands Chinese and is having a conversation with them. However, Searle himself does not understand Chinese at all. He is simply following the rules blindly.

Searle argues that this shows that a computer, which is essentially a machine that follows rules, cannot be said to understand Chinese or to have a mind. The computer may be able to produce intelligent-sounding output, but it does not have the same kind of understanding that a human being has.

The Chinese Room Experiment has been widely discussed and debated by philosophers and computer scientists. Some have argued that Searle's argument is flawed, while others have agreed with his conclusion. The Chinese Room Experiment is a complex and challenging thought experiment, and there is no easy answer to the question of whether or not it succeeds in its goal. However, it is a thought-provoking experiment that has helped to shape the debate about artificial intelligence and the nature of mind. Here are some of the key points of Searle's argument:

  • Understanding a language is not just about manipulating symbols according to rules. It also requires having a grasp of the meaning of the symbols.
  • A computer can manipulate symbols, but it does not have the same kind of understanding that a human being has.
  • The Chinese Room Experiment shows that a computer cannot be said to understand Chinese, even if it can produce intelligent-sounding output.

Large Language Model (LLM) Evaluation

YouTube ... Quora ...Google search ...Google News ...Bing News

Benchmarks for an LLM:

  • Ability to add attachments to prompts: attachments, such as images or documents, the use of attachments allows an LLM to incorporate additional information beyond the textual prompt, which can improve its ability to generate accurate and relevant responses. Claude 2: prompts can include attachments
  • Performance on the bar exam multiple-choice section: The bar exam is a standardized test that is required to practice law in the United States. The multiple-choice section tests knowledge of legal concepts and principles. Claude 2: Scored 76.5%
  • Performance on the GRE reading and writing exams: a standardized test that is often required for admission to graduate programs. The reading and writing sections test reading comprehension, analytical writing, and critical thinking skills. Claude 2: A score above the 90th percentile indicates that the LLM is highly proficient in these skills.
  • Performance on the GRE quantitative reasoning exam: This section tests mathematical and analytical skills. Claude 2: A score similar to the median applicant indicates that the LLM has average proficiency in these skills.
  • Input length limit: the maximum length of the input prompt that an LLM can handle. A token is a sequence of characters that represents a unit of meaning in natural language processing. Claude 2: A limit of 100K tokens per prompt means that the LLM can handle prompts of up to 100,000 tokens in length.
  • Context window limit: Maximum amount of context that an LLM can consider when generating a response to a prompt. Claude 2: A context window of up to 100K means that the LLM can consider up to 100,000 tokens of context when generating a response.
  • Code Generation on HumanEval: This test evaluates the LLM's ability to write code that meets certain criteria, such as correctness and efficiency. Claude 2 on Python coding test: 71.2%
  • GSM8k math problem set: This problem set evaluates the LLM's ability to solve mathematical problems of varying difficulty. Claude 2: 88%


There are several factors that should be considered while evaluating Large Language Models (LLMs). These include:

  • authenticity
  • speed
  • grammar
  • readability
  • unbiasedness
  • backtracking
  • safety
  • responsibility
  • understanding the context
  • text operations

A Survey on Evaluation of Large Language Models

  • A Survey on Evaluation of Large Language Models | Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang, Y. Chang,P. Yu, Q. Yang, X. Xie - ARxIV ... The survey covers seven major categories of LLM trustworthiness:
    • Reliability: LLMs should be able to consistently generate accurate and truthful outputs, even when presented with new or challenging inputs.
    • Safety: LLMs should not generate outputs that are harmful or dangerous, such as outputs that promote violence or hate speech.
    • Fairness: LLMs should not discriminate against any individual or group of individuals, regardless of their race, gender, sexual orientation, or other protected characteristics.
    • Resistance to misuse: LLMs should be designed in a way that makes it difficult for them to be used for malicious purposes, such as generating fake news or propaganda.
    • Explainability and reasoning: LLMs should be able to explain their reasoning behind their outputs, so that users can understand how they work and make informed decisions about how to use them.
    • Adherence to social norms: LLMs should generate outputs that are consistent with social norms and values, such as avoiding offensive language or promoting harmful stereotypes.
    • Robustness: LLMs should be able to withstand attacks and manipulation, such as being fed deliberately misleading or harmful data.


Popular Benchmarks for Testing LLMs

  • AI2 Reasoning Challenge (ARC):designed to promote research in advanced question-answering, particularly questions that require reasoning. The ARC dataset consists of 7,787 science exam questions from grade 3 to grade 9, with a supporting knowledge base of 14.3M unstructured text passages. The benchmark evaluates the performance of LLMs in answering multiple-choice questions.
  • WinoGrande: evaluate the ability of LLMs to perform commonsense reasoning. The benchmark consists of 44,000 examples that require the model to understand the meaning of words in context and to reason about the relationships between entities.
  • Advanced Reasoning Benchmark (ARB): evaluate the ability of LLMs to perform complex reasoning tasks. The benchmark consists of 1,000 examples that require the model to perform multi-step reasoning and to integrate information from multiple sources.
  • Holistic Evaluation of Language Models (HELM): evaluate the performance of LLMs in multiple tasks, including language modeling, question answering, and summarization. The benchmark consists of 57 datasets covering a wide range of tasks and domains.
  • Big Bench: evaluate the performance of LLMs in a wide range of tasks, including language modeling, question answering, and summarization. The benchmark consists of 800 diverse tasks that require the model to perform complex reasoning and to integrate information from multiple sources.
  • Massive Multitask Language Understanding (MMLU): evaluate the performance of LLMs in multiple tasks, including language modeling, question answering, and summarization. The benchmark consists of 20 diverse tasks that require the model to perform complex reasoning and to integrate information from multiple sources.
  • SQuAD: tests LLMs on their ability to answer questions about a given passage of text. The SQuAD dataset is a collection of questions and answers that are created by crowdworkers on a set of Wikipedia articles.
  • GLUE: tests LLMs on a variety of natural language understanding tasks, including sentiment analysis, text classification, and question answering.
  • SuperGLUE: an extension of the GLUE benchmark that includes more challenging tasks. The tasks in the benchmark are:
    • CoLA: Corpus of Linguistic Acceptability
    • SST-2: Stanford Sentiment Treebank
    • MRPC: Microsoft Research Paraphrase Corpus
    • STS-B: Semantic Textual Similarity Benchmark
    • QQP: Quora Question Pairs
    • MNLI: MultiNLI
    • QNLI: Question Natural Language Inference
    • RTE: Recognizing Textual Entailment
    • WNLI: Winograd Schema Challenge
    • AX: Adversarial Textual Entailment


Evaluating Large Language Models on Clinical & Biomedical NLP Benchmarks

Evaluating Large Language Models on Legal Reasoning

LegalBench: The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning -- which distinguish between its many forms -- correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.

Natural Language Processing (NLP) Evaluation

General Language Understanding Evaluation (GLUE)

The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems... like picking out the names of people and organizations in a sentence and figuring out what a pronoun like “it” refers to when there are multiple potential antecedents. GLUE consists of:

  • A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty,
  • A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and
  • A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.

State of the Art in Natural Language Processing (NLP)
Jeff Heaton Algorithms such as BERT, T4, ERNIE, and others claim to be the state of the art for NLP programs. But what does this mean? How is this evaluated. In this video I look at GLUE and other NLP benchmarks.

The Stanford Question Answering Dataset (SQuAD)

Applying BERT to Question Answering (SQuAD v1.1)
In this video I’ll explain the details of how BERT is used to perform “Question Answering”--specifically, how it’s applied to SQuAD v1.1 (Stanford Question Answering Dataset). I’ll also walk us through the following notebook, where we’ll take a model that’s already been fine-tuned on SQuAD, and apply it to our own questions and text.

Question and Answering System for the SQuAD Dataset
CS224N default final project presentation

Machine Learning Evaluation


Procgen

ezgif-4-3630016ea205.gif

OpenAI previously released Neural MMO, a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available SafetyGym, a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.

Human Evaluation

CAPTCHA stands for "Completely Automated Public Turing test to tell Computers and Humans Apart". It's a security measure that helps protect users from spam and password decryption by verifying that a user is human and not a computer.

I'm not a robot

No CAPTCHA reCAPTCHA: Popularized by Google, this involves a simple checkbox labeled "I am not a robot." It works by analyzing user behavior, such as mouse movements to determine if the user is human. If the test is inconclusive, a more traditional image selection CAPTCHA is presented. When you click the checkbox, reCAPTCHA monitors:

  • Mouse movements: Human mouse movements tend to be unpredictable, while bots often exhibit linear or mechanical movements.
  • Click timing: Humans have natural delays in their actions, while bots execute them at near-instantaneous speeds.

Improving AI while trying to outsmart it

The so-called "bot test"—like CAPTCHAs, where users identify objects in images or complete other seemingly trivial tasks—has a dual purpose. While it's meant to distinguish between humans and bots, the data collected often helps train AI systems to improve at tasks like image recognition, text understanding, or problem-solving. the effectiveness of CAPTCHAs is constantly being challenged by advancements in artificial intelligence and machine learning. Recent research has demonstrated that advanced AI can effectively solve image-based CAPTCHAs, such as Google's reCAPTCHAv2, with a 100% success rate using YOLO models for image segmentation and classification. This highlights the need for CAPTCHA systems to evolve in response to AI advancements.

In a way, humans doing these tests are teaching the bots to get better at beating the tests themselves. It's a fascinating cycle of humans improving AI while trying to outsmart it! Irony at its finest.

Future Prospects and Innovations

The future of human evaluation CAPTCHA techniques is likely to be shaped by ongoing technological advancements and the need to balance security with user experience. Some promising developments include:

  • Advanced AI and Machine Learning Techniques: As AI becomes more sophisticated in solving CAPTCHAs, new techniques are being developed to create CAPTCHA-resistant challenges that can adapt to evolving bot strategies.
  • Invisible CAPTCHA: Google's reCAPTCHA v3 represents a significant innovation by eliminating visible challenges for users. Instead, it continuously monitors user behavior to assess the likelihood of a bot interaction, providing a score between 0 and 1.
  • Cognitive Deep-Learning CAPTCHA: A 2023 study introduced a new CAPTCHA system that combines text-based, image-based, and cognitive CAPTCHA characteristics. This system employs adversarial examples and neural style transfer to enhance security, making it more resistant to automated attacks.
  • Behavioral Analysis and Biometric Verification: Innovations are exploring the use of behavioral analysis to distinguish human actions from bot interactions without explicit challenges. Biometric identification is also being considered for seamless user authentication, leveraging unique user characteristics.
  • AI-Powered Solutions: AI algorithms are being developed to create CAPTCHA-resistant challenges that can adapt to evolving bot strategies. This includes employing AI to design intelligent algorithms that better distinguish bot activity from human input.

Evaluating Machine Learning (ML) Hardware, Software, and Services

MLPerf

MLPerf: A Benchmark Suite for Machine Learning - Gu-Yeon Wei (Harvard University)
O'Reilly

MLPerf: A Benchmark Suite for Machine Learning - David Patterson (UC Berkeley)
O'Reilly

MLPerf Benchmarks
Geoff Tate, CEO of Flex Logix, talks about the new MLPerf benchmark, what’s missing from the benchmark, and which ones are relevant to edge inferencing.

Exploring the Impact of System Storage on AI & ML Workloads via MLPerf Benchmark Suite
Wes Vaske This is the presentation I gave at Flash Memory Summit 2019 in the AI/ML track. In it I discuss some benchmark results that I've collected over the past year at Micron from running the MLPerf benchmark suite. AIML-301-1:Using AI/ML for Flash Performance Scaling, Part 1 -

Backtracking

Backtracking is a general algorithmic technique that considers searching every possible combination in order to solve a computational problem. It incrementally builds candidates to the solutions and abandons a candidate’s backtracks as soon as it determines that the candidate cannot be completed to a reasonable solution. In machine learning, backtracking can be used to solve constraint satisfaction problems, such as crosswords, verbal arithmetic, Sudoku, and many other puzzles.



American Productivity & Quality Center (APQC)

APQC provides the information, data, and insights organizations need to work smarter, faster, and with greater confidence. A non-profit organization, we provide independent, unbiased, and validated research and data to our more than 1,000 organizational members in 45 industries worldwide. Our members have exclusive access to the world’s largest set of benchmark data, with more than 4,000,000 data points. \