Difference between revisions of "Benchmarks"
m (→Large Language Model (LLM) Evaluation) |
m (→Measuring training and inference performance of ML hardware, software, and services) |
||
| Line 229: | Line 229: | ||
|}<!-- B --> | |}<!-- B --> | ||
| − | = | + | = Evaluating Machine Learning (ML) Hardware, Software, and Services = |
| + | * [[What is Artificial Intelligence (AI)? | Artificial Intelligence (AI)]] ... [[Machine Learning (ML)]] ... [[Deep Learning]] ... [[Neural Network]] ... [[Reinforcement Learning (RL)|Reinforcement]] ... [[Learning Techniques]] | ||
| + | |||
== <span id="MLPerf"></span>MLPerf == | == <span id="MLPerf"></span>MLPerf == | ||
* [https://mlperf.org/ MLPerf] benchmarks for measuring training and inference performance of ML hardware, software, and services. | * [https://mlperf.org/ MLPerf] benchmarks for measuring training and inference performance of ML hardware, software, and services. | ||
Revision as of 19:59, 13 July 2023
YouTube ... Quora ...Google search ...Google News ...Bing News
- Data Science ... Governance ... Preprocessing ... Exploration ... Interoperability ... Master Data Management (MDM) ... Bias and Variances ... Benchmarks ... Datasets
- Large Language Model (LLM) ... Natural Language Processing (NLP) ... Generation ... Classification ... Understanding ... Translation ... Tools & Services
- Risk, Compliance and Regulation ... Ethics ... Privacy ... Law ... AI Governance ... AI Verification and Validation
- Case Studies
- ALFRED ... Action Learning From Realistic Environments and Directives
- Algorithm Administration
- Data Quality ...validity, accuracy, cleaning, completeness, consistency, encoding, padding, augmentation, labeling, auto-tagging, normalization, standardization, and imbalanced data
- Managed Vocabularies
- Excel ... Documents ... Database ... Graph ... LlamaIndex
- Analytics ... Visualization ... Graphical Tools ... Diagrams & Business Analysis ... Requirements ... Loop ... Bayes ... Network Pattern
- Development ... Notebooks ... AI Pair Programming ... Codeless, Generators, Drag n' Drop ... AIOps/MLOps ... AIaaS/MLaaS
- Backpropagation ... FFNN ... Forward-Forward ... Activation Functions ...Softmax ... Loss ... Boosting ... Gradient Descent ... Hyperparameter ... Manifold Hypothesis ... PCA
- Strategy & Tactics ... Project Management ... Best Practices ... Checklists ... Project Check-in ... Evaluation ... Measures
- AI Solver ... Algorithms ... Administration ... Model Search ... Discriminative vs. Generative ... Optimizer ... Train, Validate, and Test
- Machine Learning Benchmarks and AI Self-Driving Cars | Lance Eliot - AItrends
- Benchmarking simple models with feature extraction against modern black-box methods | Martin Dittgen - Towards Data Science
- DAWNBench | Stanford - an End-to-End Deep Learning Benchmark and Competition
- Benchmarking 20 Machine Learning Models Accuracy and Speed | Marc Borowczak - Data Science Central
- Benchmarking deep learning models on large healthcare datasets | S. Purushotham, C. Meng, Z. Chea, and Y. Liu
- Supercomputers Flex Their AI Muscles New benchmarks reveal science-task speedups | Sammuel K. Moore - IEEE Spectrum
Contents
AI Consciousness Testing
YouTube ... Quora ...Google search ...Google News ...Bing News
- Singularity ... Sentience ... AGI ... Curious Reasoning ... Emergence ... Moonshots ... Explainable AI ... Automated Learning
- Theory of Mind (ToM)
- In-Context Learning (ICL) ... Context ... Causation vs. Correlation ... Autocorrelation ... Out-of-Distribution (OOD) Generalization ... Transfer Learning
Turing Test
The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation was a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel, such as a computer keyboard and screen, so the result would not depend on the machine's ability to render words as speech. If the evaluator could not reliably tell the machine from the human, the machine would be said to have passed the test. The test results would not depend on the machine's ability to give correct answers to questions, only on how closely its answers resembled those a human would give. - Turing Test | Wikipedia
Today an AI has to dumb down to pass the Turing Test - Ray Kurzweil
|
|
Large Language Model (LLM) Evaluation
YouTube ... Quora ...Google search ...Google News ...Bing News
- Large Language Model (LLM)
- Generative AI ... Conversational AI ... ChatGPT | OpenAI ... Bing | Microsoft ... Bard | Google ... Claude | Anthropic ... Perplexity ... You ... Ernie | Baidu
- Claude | Anthropic
- LLM Token / Parameter / Weight
- In-Context Learning (ICL) ... Context ... Causation vs. Correlation ... Autocorrelation ... Out-of-Distribution (OOD) Generalization ... Transfer Learning
- Holistic Evaluation of Language Models (HELM) | Stanford ... a living benchmark that aims to improve the transparency of language models.
- Blazingly Fast LLM Evaluation for In-Context Learning | Jeremy Dohmann - Mosaic
- Evals - GitHub ... a framework for evaluating LLMs (large language models) or systems built using LLMs as components.
- Evaluating Large Language Models (LLMs) with Eleuther AI | Bharat Ramanathan - Weights & Biases ... With a flexible and tokenization-agnostic interface, the lm-eval library provides a single framework for evaluating and reporting auto-regressive language models on various Natural Language Understanding (NLU) tasks. There are currently over 200 evaluation tasks that support the evaluation of models such as GPT-2 ,T5, Gpt-J, Gpt-Neo, Gpt-NeoX, Flan-T5.
Benchmarks for an LLM:
- Ability to add attachments to prompts: attachments, such as images or documents, the use of attachments allows an LLM to incorporate additional information beyond the textual prompt, which can improve its ability to generate accurate and relevant responses. Claude 2: prompts can include attachments
- Performance on the bar exam multiple-choice section: The bar exam is a standardized test that is required to practice law in the United States. The multiple-choice section tests knowledge of legal concepts and principles. Claude 2: Scored 76.5%
- Performance on the GRE reading and writing exams: a standardized test that is often required for admission to graduate programs. The reading and writing sections test reading comprehension, analytical writing, and critical thinking skills. Claude 2: A score above the 90th percentile indicates that the LLM is highly proficient in these skills.
- Performance on the GRE quantitative reasoning exam: This section tests mathematical and analytical skills. Claude 2: A score similar to the median applicant indicates that the LLM has average proficiency in these skills.
- Input length limit: the maximum length of the input prompt that an LLM can handle. A token is a sequence of characters that represents a unit of meaning in natural language processing. Claude 2: A limit of 100K tokens per prompt means that the LLM can handle prompts of up to 100,000 tokens in length.
- Context window limit: Maximum amount of context that an LLM can consider when generating a response to a prompt. Claude 2: A context window of up to 100K means that the LLM can consider up to 100,000 tokens of context when generating a response.
- Code Generation on HumanEval: This test evaluates the LLM's ability to write code that meets certain criteria, such as correctness and efficiency. Claude 2 on Python coding test: 71.2%
- GSM8k math problem set: This problem set evaluates the LLM's ability to solve mathematical problems of varying difficulty. Claude 2: 88%
There are several factors that should be considered while evaluating Large Language Models (LLMs). These include:
- authenticity
- speed
- grammar
- readability
- unbiasedness
- backtracking
- safety
- responsibility
- understanding the context
- text operations
Backtracking
Backtracking is a general algorithmic technique that considers searching every possible combination in order to solve a computational problem. It incrementally builds candidates to the solutions and abandons a candidate’s backtracks as soon as it determines that the candidate cannot be completed to a reasonable solution. In machine learning, backtracking can be used to solve constraint satisfaction problems, such as crosswords, verbal arithmetic, Sudoku, and many other puzzles.
Natural Language Processing (NLP) Evaluation
- Natural Language Processing (NLP) ... Generation ... Classification ... Understanding ... Translation ... Tools & Services
General Language Understanding Evaluation (GLUE)
The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems... like picking out the names of people and organizations in a sentence and figuring out what a pronoun like “it” refers to when there are multiple potential antecedents. GLUE consists of:
- A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty,
- A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and
- A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.
|
The Stanford Question Answering Dataset (SQuAD)
|
|
Machine Learning Evaluation
|
|
|
|
Evaluating Machine Learning (ML) Hardware, Software, and Services
- Artificial Intelligence (AI) ... Machine Learning (ML) ... Deep Learning ... Neural Network ... Reinforcement ... Learning Techniques
MLPerf
- MLPerf benchmarks for measuring training and inference performance of ML hardware, software, and services.
- MLCommons debuts with public 86,000-hour speech data set for AI researchers | Devin Coldewey - TechCrunch
|
|
|
|
Procgen
- OpenAI
- OpenAI’s Procgen Benchmark prevents AI model overfitting | Kyle Wiggers - VentureBeat a set of 16 procedurally generated environments that measure how quickly a model learns generalizable skills. It builds atop the startup’s CoinRun toolset, which used procedural generation to construct sets of training and test levels.
OpenAI previously released Neural MMO, a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available SafetyGym, a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.
American Productivity & Quality Center (APQC)
- American Productivity & Quality Center (APQC) ... the world's foremost authority in benchmarking, best practices, process and performance improvement, and knowledge management.
APQC provides the information, data, and insights organizations need to work smarter, faster, and with greater confidence. A non-profit organization, we provide independent, unbiased, and validated research and data to our more than 1,000 organizational members in 45 industries worldwide. Our members have exclusive access to the world’s largest set of benchmark data, with more than 4,000,000 data points. \