Difference between revisions of "Benchmarks"

From
Jump to: navigation, search
m (AI Consciousness Testing)
m (Machine Learning Model)
Line 188: Line 188:
 
|}<!-- B -->
 
|}<!-- B -->
  
= Machine Learning Model =
+
= Machine Learning Evaluation =
 
<img src="https://www.researchgate.net/profile/Benoit_Gallix/publication/324457640/figure/fig1/AS:622298201595905@1525378861825/Graph-illustrating-the-impact-of-data-available-on-performance-of-traditional-machine.png" width="500" height="400">
 
<img src="https://www.researchgate.net/profile/Benoit_Gallix/publication/324457640/figure/fig1/AS:622298201595905@1525378861825/Graph-illustrating-the-impact-of-data-available-on-performance-of-traditional-machine.png" width="500" height="400">
 
  
 
{|<!-- T -->
 
{|<!-- T -->

Revision as of 19:52, 13 July 2023

YouTube ... Quora ...Google search ...Google News ...Bing News



AI Consciousness Testing

YouTube ... Quora ...Google search ...Google News ...Bing News

Turing Test

The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation was a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel, such as a computer keyboard and screen, so the result would not depend on the machine's ability to render words as speech. If the evaluator could not reliably tell the machine from the human, the machine would be said to have passed the test. The test results would not depend on the machine's ability to give correct answers to questions, only on how closely its answers resembled those a human would give. - Turing Test | Wikipedia



Today an AI has to dumb down to pass the Turing Test - Ray Kurzweil



What is a Turing Test? A Brief History of the Turing Test and its Impact
What is a Turing Test

Is a computer as smart as a human? Only a Turing Test will tell -- plus its many spin-offs. A Turing Test is a method of determining whether a computer is capable of thinking like a human. Watch to learn what a Turing Test is and how it relates to AI technology.

Will ChatGPT Pass The Turing Test? Let's Find Out!
I have been testing ChatGPT for the past few days and it has been nothing short of spectacular. Now the moment of truth is upon is: Will it pass the Turing Test? Find out in this video. Will it exhibit intelligence that will fool humans into thinking it's not a machine? You'd be surprised!

ChatGPT says:

"The Turing Test is a measure of a machine's ability to exhibit intelligent behavior that is indistinguishable from that of a human. It was first proposed by the British mathematician and computer scientist Alan Turing in 1950. The basic idea of the test is that a human evaluator engages in a text-based conversation with both a human and a machine, without knowing which is which. If the evaluator is unable to reliably determine which is the human and which is the machine, then the machine is said to have passed the Turing Test and demonstrated human-like intelligence.

The Turing Test has become an influential concept in the field of artificial intelligence and continues to be an active area of research and development. While some AI systems have been able to fool evaluators into thinking they are human in limited cases, no machine has yet passed the Turing Test in a comprehensive and sustained manner. Nonetheless, the Turing Test remains a useful benchmark for evaluating the progress of AI and a means for stimulating discussion about the nature of human intelligence and the potential for machines to possess similar capabilities."

Large Language Model (LLM) Evaluation

YouTube ... Quora ...Google search ...Google News ...Bing News


Benchmarks for an LLM:

  • Ability to add attachments to prompts: attachments, such as images or documents, the use of attachments allows an LLM to incorporate additional information beyond the textual prompt, which can improve its ability to generate accurate and relevant responses. Claude 2: prompts can include attachments
  • Performance on the bar exam multiple-choice section: The bar exam is a standardized test that is required to practice law in the United States. The multiple-choice section tests knowledge of legal concepts and principles. Claude 2: Scored 76.5%
  • Performance on the GRE reading and writing exams: a standardized test that is often required for admission to graduate programs. The reading and writing sections test reading comprehension, analytical writing, and critical thinking skills. Claude 2: A score above the 90th percentile indicates that the LLM is highly proficient in these skills.
  • Performance on the GRE quantitative reasoning exam: This section tests mathematical and analytical skills. Claude 2: A score similar to the median applicant indicates that the LLM has average proficiency in these skills.
  • Input length limit: the maximum length of the input prompt that an LLM can handle. A token is a sequence of characters that represents a unit of meaning in natural language processing. Claude 2: A limit of 100K tokens per prompt means that the LLM can handle prompts of up to 100,000 tokens in length.
  • Context window limit: Maximum amount of context that an LLM can consider when generating a response to a prompt. Claude 2: A context window of up to 100K means that the LLM can consider up to 100,000 tokens of context when generating a response.
  • Code Generation on HumanEval: This test evaluates the LLM's ability to write code that meets certain criteria, such as correctness and efficiency. Claude 2 on Python coding test: 71.2%
  • GSM8k math problem set: This problem set evaluates the LLM's ability to solve mathematical problems of varying difficulty. Claude 2: 88%

There are several factors that should be considered while evaluating Large Language Models (LLMs). These include:

  • authenticity
  • speed
  • grammar
  • readability
  • unbiasedness
  • backtracking
  • safety
  • responsibility
  • understanding the context
  • text operations

Backtracking

Backtracking is a general algorithmic technique that considers searching every possible combination in order to solve a computational problem. It incrementally builds candidates to the solutions and abandons a candidate’s backtracks as soon as it determines that the candidate cannot be completed to a reasonable solution. In machine learning, backtracking can be used to solve constraint satisfaction problems, such as crosswords, verbal arithmetic, Sudoku, and many other puzzles.

Natural Language Processing (NLP) Evaluation

General Language Understanding Evaluation (GLUE)

The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems... like picking out the names of people and organizations in a sentence and figuring out what a pronoun like “it” refers to when there are multiple potential antecedents. GLUE consists of:

  • A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty,
  • A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and
  • A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.

State of the Art in Natural Language Processing (NLP)
Jeff Heaton Algorithms such as BERT, T4, ERNIE, and others claim to be the state of the art for NLP programs. But what does this mean? How is this evaluated. In this video I look at GLUE and other NLP benchmarks.

The Stanford Question Answering Dataset (SQuAD)

Applying BERT to Question Answering (SQuAD v1.1)
In this video I’ll explain the details of how BERT is used to perform “Question Answering”--specifically, how it’s applied to SQuAD v1.1 (Stanford Question Answering Dataset). I’ll also walk us through the following notebook, where we’ll take a model that’s already been fine-tuned on SQuAD, and apply it to our own questions and text.

Question and Answering System for the SQuAD Dataset
CS224N default final project presentation

Machine Learning Evaluation

Lecture 13 – Evaluation Metrics | Stanford CS224U: Natural Language Understanding | Spring 2019
Professor Christopher Potts Professor of Linguistics and, by courtesy, Computer Science Director, Stanford Center for the Study of Language and Information Consulting Assistant Professor Bill MacCartney Senior Engineering Manager, Apple

Kaggle Reading Group : An Open Source AutoML Benchmark | Kaggle
This week we're starting a new paper: An Open Source AutoML Benchmark by Gijsbers et al from the 2019 ICML Workshop on Automated Machine Learning.


Machine Learning Model Evaluation Metrics
MARIA KHALUSOVA | DEVELOPER ADVOCATE AT JETBRAINS Choosing the right evaluation metric for your machine learning project is crucial, as it decides which model you’ll ultimately use. Those coming to ML from software development are often self-taught, but practice exercises and competitions generally dictate the evaluation metric. In a real-world scenario, how do you choose an appropriate metric? This talk will explore the important evaluation metrics used in regression and classification tasks, their pros and cons, and how to make a smart decision.

Characterization and Benchmarking of Deep Learning
In this video from the HPC User Forum in Milwaukee, Natalia Vassilieva from HP Labs presents: Characterization and Benchmarking of Deep Learning.

Measuring training and inference performance of ML hardware, software, and services

MLPerf

MLPerf: A Benchmark Suite for Machine Learning - Gu-Yeon Wei (Harvard University)
O'Reilly

MLPerf: A Benchmark Suite for Machine Learning - David Patterson (UC Berkeley)
O'Reilly

MLPerf Benchmarks
Geoff Tate, CEO of Flex Logix, talks about the new MLPerf benchmark, what’s missing from the benchmark, and which ones are relevant to edge inferencing.

Exploring the Impact of System Storage on AI & ML Workloads via MLPerf Benchmark Suite
Wes Vaske This is the presentation I gave at Flash Memory Summit 2019 in the AI/ML track. In it I discuss some benchmark results that I've collected over the past year at Micron from running the MLPerf benchmark suite. AIML-301-1:Using AI/ML for Flash Performance Scaling, Part 1 -

Procgen

ezgif-4-3630016ea205.gif

OpenAI previously released Neural MMO, a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available SafetyGym, a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.


American Productivity & Quality Center (APQC)

APQC provides the information, data, and insights organizations need to work smarter, faster, and with greater confidence. A non-profit organization, we provide independent, unbiased, and validated research and data to our more than 1,000 organizational members in 45 industries worldwide. Our members have exclusive access to the world’s largest set of benchmark data, with more than 4,000,000 data points. \