Revision as of 19:52, 13 July 2023

YouTube ... Quora ...Google search ...Google News ...Bing News

Data Science ... Governance ... Preprocessing ... Exploration ... Interoperability ... Master Data Management (MDM) ... Bias and Variances ... Benchmarks ... Datasets
Large Language Model (LLM) ... Natural Language Processing (NLP) ... Generation ... Classification ... Understanding ... Translation ... Tools & Services
Risk, Compliance and Regulation ... Ethics ... Privacy ... Law ... AI Governance ... AI Verification and Validation
Case Studies
- Gaming
- Model Monitoring
ALFRED ... Action Learning From Realistic Environments and Directives
Algorithm Administration
Data Quality ...validity, accuracy, cleaning, completeness, consistency, encoding, padding, augmentation, labeling, auto-tagging, normalization, standardization, and imbalanced data
Managed Vocabularies
Excel ... Documents ... Database ... Graph ... LlamaIndex
Analytics ... Visualization ... Graphical Tools ... Diagrams & Business Analysis ... Requirements ... Loop ... Bayes ... Network Pattern
Development ... Notebooks ... AI Pair Programming ... Codeless, Generators, Drag n' Drop ... AIOps/MLOps ... AIaaS/MLaaS
Backpropagation ... FFNN ... Forward-Forward ... Activation Functions ...Softmax ... Loss ... Boosting ... Gradient Descent ... Hyperparameter ... Manifold Hypothesis ... PCA
Strategy & Tactics ... Project Management ... Best Practices ... Checklists ... Project Check-in ... Evaluation ... Measures
- Accuracy
- Precision & Recall (Sensitivity)
- Specificity
AI Solver ... Algorithms ... Administration ... Model Search ... Discriminative vs. Generative ... Optimizer ... Train, Validate, and Test
Machine Learning Benchmarks and AI Self-Driving Cars | Lance Eliot - AItrends
Benchmarking simple models with feature extraction against modern black-box methods | Martin Dittgen - Towards Data Science
DAWNBench | Stanford - an End-to-End Deep Learning Benchmark and Competition
Benchmarking 20 Machine Learning Models Accuracy and Speed | Marc Borowczak - Data Science Central
Benchmarking deep learning models on large healthcare datasets | S. Purushotham, C. Meng, Z. Chea, and Y. Liu
Supercomputers Flex Their AI Muscles New benchmarks reveal science-task speedups | Sammuel K. Moore - IEEE Spectrum

1 AI Consciousness Testing
- 1.1 Turing Test
2 Large Language Model (LLM) Evaluation
- 2.1 Backtracking
3 Natural Language Processing (NLP) Evaluation
- 3.1 General Language Understanding Evaluation (GLUE)
- 3.2 The Stanford Question Answering Dataset (SQuAD)
4 Machine Learning Evaluation
5 Measuring training and inference performance of ML hardware, software, and services
- 5.1 MLPerf
6 Procgen
7 American Productivity & Quality Center (APQC)

AI Consciousness Testing

YouTube ... Quora ...Google search ...Google News ...Bing News

Singularity ... Sentience ... AGI ... Curious Reasoning ... Emergence ... Moonshots ... Explainable AI ... Automated Learning
Theory of Mind (ToM)
In-Context Learning (ICL) ... Context ... Causation vs. Correlation ... Autocorrelation ... Out-of-Distribution (OOD) Generalization ... Transfer Learning

Turing Test

The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation was a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel, such as a computer keyboard and screen, so the result would not depend on the machine's ability to render words as speech. If the evaluator could not reliably tell the machine from the human, the machine would be said to have passed the test. The test results would not depend on the machine's ability to give correct answers to questions, only on how closely its answers resembled those a human would give. - Turing Test | Wikipedia

Today an AI has to dumb down to pass the Turing Test - Ray Kurzweil

What is a Turing Test? A Brief History of the Turing Test and its Impact What is a Turing Test Is a computer as smart as a human? Only a Turing Test will tell -- plus its many spin-offs. A Turing Test is a method of determining whether a computer is capable of thinking like a human. Watch to learn what a Turing Test is and how it relates to AI technology.

Will ChatGPT Pass The Turing Test? Let's Find Out! I have been testing ChatGPT for the past few days and it has been nothing short of spectacular. Now the moment of truth is upon is: Will it pass the Turing Test? Find out in this video. Will it exhibit intelligence that will fool humans into thinking it's not a machine? You'd be surprised! ChatGPT says: "The Turing Test is a measure of a machine's ability to exhibit intelligent behavior that is indistinguishable from that of a human. It was first proposed by the British mathematician and computer scientist Alan Turing in 1950. The basic idea of the test is that a human evaluator engages in a text-based conversation with both a human and a machine, without knowing which is which. If the evaluator is unable to reliably determine which is the human and which is the machine, then the machine is said to have passed the Turing Test and demonstrated human-like intelligence. The Turing Test has become an influential concept in the field of artificial intelligence and continues to be an active area of research and development. While some AI systems have been able to fool evaluators into thinking they are human in limited cases, no machine has yet passed the Turing Test in a comprehensive and sustained manner. Nonetheless, the Turing Test remains a useful benchmark for evaluating the progress of AI and a means for stimulating discussion about the nature of human intelligence and the potential for machines to possess similar capabilities."

Large Language Model (LLM) Evaluation

YouTube ... Quora ...Google search ...Google News ...Bing News

Large Language Model (LLM)
Generative AI ... Conversational AI ... ChatGPT | OpenAI ... Bing | Microsoft ... Bard | Google ... Claude | Anthropic ... Perplexity ... You ... Ernie | Baidu
Claude | Anthropic
LLM Token / Parameter / Weight
In-Context Learning (ICL) ... Context
Holistic Evaluation of Language Models (HELM) | Stanford ... a living benchmark that aims to improve the transparency of language models.
Blazingly Fast LLM Evaluation for In-Context Learning | Jeremy Dohmann - Mosaic
Evals - GitHub ... a framework for evaluating LLMs (large language models) or systems built using LLMs as components.
Evaluating Large Language Models (LLMs) with Eleuther AI | Bharat Ramanathan - Weights & Biases ... With a flexible and tokenization-agnostic interface, the lm-eval library provides a single framework for evaluating and reporting auto-regressive language models on various Natural Language Understanding (NLU) tasks. There are currently over 200 evaluation tasks that support the evaluation of models such as GPT-2 ,T5, Gpt-J, Gpt-Neo, Gpt-NeoX, Flan-T5.

Benchmarks for an LLM:

Ability to add attachments to prompts: attachments, such as images or documents, the use of attachments allows an LLM to incorporate additional information beyond the textual prompt, which can improve its ability to generate accurate and relevant responses. Claude 2: prompts can include attachments
Performance on the bar exam multiple-choice section: The bar exam is a standardized test that is required to practice law in the United States. The multiple-choice section tests knowledge of legal concepts and principles. Claude 2: Scored 76.5%
Performance on the GRE reading and writing exams: a standardized test that is often required for admission to graduate programs. The reading and writing sections test reading comprehension, analytical writing, and critical thinking skills. Claude 2: A score above the 90th percentile indicates that the LLM is highly proficient in these skills.
Performance on the GRE quantitative reasoning exam: This section tests mathematical and analytical skills. Claude 2: A score similar to the median applicant indicates that the LLM has average proficiency in these skills.
Input length limit: the maximum length of the input prompt that an LLM can handle. A token is a sequence of characters that represents a unit of meaning in natural language processing. Claude 2: A limit of 100K tokens per prompt means that the LLM can handle prompts of up to 100,000 tokens in length.
Context window limit: Maximum amount of context that an LLM can consider when generating a response to a prompt. Claude 2: A context window of up to 100K means that the LLM can consider up to 100,000 tokens of context when generating a response.
Code Generation on HumanEval: This test evaluates the LLM's ability to write code that meets certain criteria, such as correctness and efficiency. Claude 2 on Python coding test: 71.2%
GSM8k math problem set: This problem set evaluates the LLM's ability to solve mathematical problems of varying difficulty. Claude 2: 88%

There are several factors that should be considered while evaluating Large Language Models (LLMs). These include:

authenticity
speed
grammar
readability
unbiasedness
backtracking
safety
responsibility
understanding the context
text operations

Backtracking

Backtracking is a general algorithmic technique that considers searching every possible combination in order to solve a computational problem. It incrementally builds candidates to the solutions and abandons a candidate’s backtracks as soon as it determines that the candidate cannot be completed to a reasonable solution. In machine learning, backtracking can be used to solve constraint satisfaction problems, such as crosswords, verbal arithmetic, Sudoku, and many other puzzles.

Natural Language Processing (NLP) Evaluation

Natural Language Processing (NLP) ... Generation ... Classification ... Understanding ... Translation ... Tools & Services

General Language Understanding Evaluation (GLUE)

General Language Understanding Evaluation (GLUE)

The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems... like picking out the names of people and organizations in a sentence and figuring out what a pronoun like “it” refers to when there are multiple potential antecedents. GLUE consists of:

A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty,
A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and
A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.

State of the Art in Natural Language Processing (NLP) Jeff Heaton Algorithms such as BERT, T4, ERNIE, and others claim to be the state of the art for NLP programs. But what does this mean? How is this evaluated. In this video I look at GLUE and other NLP benchmarks.

The Stanford Question Answering Dataset (SQuAD)

The Stanford Question Answering Dataset (SQuAD)

Applying BERT to Question Answering (SQuAD v1.1) In this video I’ll explain the details of how BERT is used to perform “Question Answering”--specifically, how it’s applied to SQuAD v1.1 (Stanford Question Answering Dataset). I’ll also walk us through the following notebook, where we’ll take a model that’s already been fine-tuned on SQuAD, and apply it to our own questions and text.

Question and Answering System for the SQuAD Dataset CS224N default final project presentation

Machine Learning Evaluation

Lecture 13 – Evaluation Metrics \| Stanford CS224U: Natural Language Understanding \| Spring 2019 Professor Christopher Potts Professor of Linguistics and, by courtesy, Computer Science Director, Stanford Center for the Study of Language and Information Consulting Assistant Professor Bill MacCartney Senior Engineering Manager, Apple

Kaggle Reading Group : An Open Source AutoML Benchmark \| Kaggle This week we're starting a new paper: An Open Source AutoML Benchmark by Gijsbers et al from the 2019 ICML Workshop on Automated Machine Learning.

Machine Learning Model Evaluation Metrics MARIA KHALUSOVA \| DEVELOPER ADVOCATE AT JETBRAINS Choosing the right evaluation metric for your machine learning project is crucial, as it decides which model you’ll ultimately use. Those coming to ML from software development are often self-taught, but practice exercises and competitions generally dictate the evaluation metric. In a real-world scenario, how do you choose an appropriate metric? This talk will explore the important evaluation metrics used in regression and classification tasks, their pros and cons, and how to make a smart decision.

Characterization and Benchmarking of Deep Learning In this video from the HPC User Forum in Milwaukee, Natalia Vassilieva from HP Labs presents: Characterization and Benchmarking of Deep Learning.

Measuring training and inference performance of ML hardware, software, and services

MLPerf

MLPerf benchmarks for measuring training and inference performance of ML hardware, software, and services.
MLCommons debuts with public 86,000-hour speech data set for AI researchers | Devin Coldewey - TechCrunch

MLPerf: A Benchmark Suite for Machine Learning - Gu-Yeon Wei (Harvard University) O'Reilly

MLPerf: A Benchmark Suite for Machine Learning - David Patterson (UC Berkeley) O'Reilly

MLPerf Benchmarks Geoff Tate, CEO of Flex Logix, talks about the new MLPerf benchmark, what’s missing from the benchmark, and which ones are relevant to edge inferencing.

Exploring the Impact of System Storage on AI & ML Workloads via MLPerf Benchmark Suite Wes Vaske This is the presentation I gave at Flash Memory Summit 2019 in the AI/ML track. In it I discuss some benchmark results that I've collected over the past year at Micron from running the MLPerf benchmark suite. AIML-301-1:Using AI/ML for Flash Performance Scaling, Part 1 -

Procgen

OpenAI
OpenAI’s Procgen Benchmark prevents AI model overfitting | Kyle Wiggers - VentureBeat a set of 16 procedurally generated environments that measure how quickly a model learns generalizable skills. It builds atop the startup’s CoinRun toolset, which used procedural generation to construct sets of training and test levels.

OpenAI previously released Neural MMO, a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available SafetyGym, a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.

American Productivity & Quality Center (APQC)

American Productivity & Quality Center (APQC) ... the world's foremost authority in benchmarking, best practices, process and performance improvement, and knowledge management.

APQC provides the information, data, and insights organizations need to work smarter, faster, and with greater confidence. A non-profit organization, we provide independent, unbiased, and validated research and data to our more than 1,000 organizational members in 45 industries worldwide. Our members have exclusive access to the world’s largest set of benchmark data, with more than 4,000,000 data points. \

@@ Line 188: / Line 188: @@
 |}<!-- B -->
-= Machine Learning Model =
+= Machine Learning Evaluation =
 <img src="https://www.researchgate.net/profile/Benoit_Gallix/publication/324457640/figure/fig1/AS:622298201595905@1525378861825/Graph-illustrating-the-impact-of-data-available-on-performance-of-traditional-machine.png" width="500" height="400">
 {|<!-- T -->

Difference between revisions of "Benchmarks"

Revision as of 19:52, 13 July 2023

Contents

AI Consciousness Testing

Turing Test

Large Language Model (LLM) Evaluation

Backtracking

Natural Language Processing (NLP) Evaluation

General Language Understanding Evaluation (GLUE)

The Stanford Question Answering Dataset (SQuAD)

Machine Learning Evaluation

Measuring training and inference performance of ML hardware, software, and services

MLPerf

Procgen

American Productivity & Quality Center (APQC)

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools