Benchmarks
YouTube ... Quora ...Google search ...Google News ...Bing News
- Data Science ... Governance ... Preprocessing ... Exploration ... Interoperability ... Master Data Management (MDM) ... Bias and Variances ... Benchmarks ... Datasets
- Large Language Model (LLM) ... Natural Language Processing (NLP) ... Generation ... Classification ... Understanding ... Translation ... Tools & Services
- Risk, Compliance and Regulation ... Ethics ... Privacy ... Law ... AI Governance ... AI Verification and Validation
- Case Studies
- ALFRED ... Action Learning From Realistic Environments and Directives
- Algorithm Administration
- Data Quality ...validity, accuracy, cleaning, completeness, consistency, encoding, padding, augmentation, labeling, auto-tagging, normalization, standardization, and imbalanced data
- Managed Vocabularies
- Excel ... Documents ... Database; Vector & Relational ... Graph ... LlamaIndex
- Analytics ... Visualization ... Graphical Tools ... Diagrams & Business Analysis ... Requirements ... Loop ... Bayes ... Network Pattern
- Development ... Notebooks ... AI Pair Programming ... Codeless ... Hugging Face ... AIOps/MLOps ... AIaaS/MLaaS
- Backpropagation ... FFNN ... Forward-Forward ... Activation Functions ...Softmax ... Loss ... Boosting ... Gradient Descent ... Hyperparameter ... Manifold Hypothesis ... PCA
- Strategy & Tactics ... Project Management ... Best Practices ... Checklists ... Project Check-in ... Evaluation ... Measures
- AI Solver ... Algorithms ... Administration ... Model Search ... Discriminative vs. Generative ... Train, Validate, and Test
- Machine Learning Benchmarks and AI Self-Driving Cars | Lance Eliot - AItrends
- Benchmarking simple models with feature extraction against modern black-box methods | Martin Dittgen - Towards Data Science
- DAWNBench | Stanford - an End-to-End Deep Learning Benchmark and Competition
- Benchmarking 20 Machine Learning Models Accuracy and Speed | Marc Borowczak - Data Science Central
- Benchmarking deep learning models on large healthcare datasets | S. Purushotham, C. Meng, Z. Chea, and Y. Liu
- Supercomputers Flex Their AI Muscles New benchmarks reveal science-task speedups | Sammuel K. Moore - IEEE Spectrum
- The Olympics of AI: Benchmarking Machine Learning Systems | Matthew Stewart - Towards Data Science - Medium ... How do benchmarks birth breakthroughs?
You can’t improve what you don’t measure. — Peter Drucker
Contents
AI Consciousness Testing
YouTube ... Quora ...Google search ...Google News ...Bing News
- Artificial General Intelligence (AGI) to Singularity ... Curious Reasoning ... Emergence ... Moonshots ... Explainable AI ... Automated Learning
- Theory of Mind (ToM)
- Perspective ... Context ... In-Context Learning (ICL) ... Transfer Learning ... Out-of-Distribution (OOD) Generalization
- Causation vs. Correlation ... Autocorrelation ...Convolution vs. Cross-Correlation (Autocorrelation)
Turing Test
The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation was a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel, such as a computer keyboard and screen, so the result would not depend on the machine's ability to render words as speech. If the evaluator could not reliably tell the machine from the human, the machine would be said to have passed the test. The test results would not depend on the machine's ability to give correct answers to questions, only on how closely its answers resembled those a human would give. - Turing Test | Wikipedia
Today an AI has to dumb down to pass the Turing Test - Ray Kurzweil
|
|
Chinese Room Thought Experiment
The Chinese Room Experiment is a thought experiment proposed by John Searle in 1980 to argue against the claim that a computer can have a mind or be conscious. Searle's argument has been criticized by some philosophers and computer scientists. However, it remains a powerful argument against the claim that computers can have a mind or be conscious.
In the experiment, Searle imagines himself locked in a room with a set of rules for manipulating Chinese symbols. The rules are written in English, which Searle understands, but the Chinese symbols are meaningless to him. He is given Chinese characters on slips of paper, which he then processes according to the rules. He then produces Chinese characters on slips of paper in response. To an outside observer, it would appear that Searle understands Chinese and is having a conversation with them. However, Searle himself does not understand Chinese at all. He is simply following the rules blindly.
Searle argues that this shows that a computer, which is essentially a machine that follows rules, cannot be said to understand Chinese or to have a mind. The computer may be able to produce intelligent-sounding output, but it does not have the same kind of understanding that a human being has.
The Chinese Room Experiment has been widely discussed and debated by philosophers and computer scientists. Some have argued that Searle's argument is flawed, while others have agreed with his conclusion. The Chinese Room Experiment is a complex and challenging thought experiment, and there is no easy answer to the question of whether or not it succeeds in its goal. However, it is a thought-provoking experiment that has helped to shape the debate about artificial intelligence and the nature of mind. Here are some of the key points of Searle's argument:
- Understanding a language is not just about manipulating symbols according to rules. It also requires having a grasp of the meaning of the symbols.
- A computer can manipulate symbols, but it does not have the same kind of understanding that a human being has.
- The Chinese Room Experiment shows that a computer cannot be said to understand Chinese, even if it can produce intelligent-sounding output.
Large Language Model (LLM) Evaluation
YouTube ... Quora ...Google search ...Google News ...Bing News
- Large Language Model (LLM)
- Conversational AI ... ChatGPT | OpenAI ... Bing/Copilot | Microsoft ... Gemini | Google ... Claude | Anthropic ... Perplexity ... You ... phind ... Ernie | Baidu
- Claude | Anthropic
- LLM Token / Parameter / Weight
- In-Context Learning (ICL) ... Context ... Causation vs. Correlation ... Autocorrelation ... Out-of-Distribution (OOD) Generalization ... Transfer Learning
- Holistic Evaluation of Language Models (HELM) | Stanford ... a living benchmark that aims to improve the transparency of language models.
- Blazingly Fast LLM Evaluation for In-Context Learning | Jeremy Dohmann - Mosaic
- Evals - GitHub ... a framework for evaluating LLMs (large language models) or systems built using LLMs as components.
- Evaluating Large Language Models (LLMs) with Eleuther AI | Bharat Ramanathan - Weights & Biases ... With a flexible and tokenization-agnostic interface, the lm-eval library provides a single framework for evaluating and reporting auto-regressive language models on various Natural Language Understanding (NLU) tasks. There are currently over 200 evaluation tasks that support the evaluation of models such as GPT-2 ,T5, Gpt-J, Gpt-Neo, Gpt-NeoX, Flan-T5.
Benchmarks for an LLM:
- Ability to add attachments to prompts: attachments, such as images or documents, the use of attachments allows an LLM to incorporate additional information beyond the textual prompt, which can improve its ability to generate accurate and relevant responses. Claude 2: prompts can include attachments
- Performance on the bar exam multiple-choice section: The bar exam is a standardized test that is required to practice law in the United States. The multiple-choice section tests knowledge of legal concepts and principles. Claude 2: Scored 76.5%
- Performance on the GRE reading and writing exams: a standardized test that is often required for admission to graduate programs. The reading and writing sections test reading comprehension, analytical writing, and critical thinking skills. Claude 2: A score above the 90th percentile indicates that the LLM is highly proficient in these skills.
- Performance on the GRE quantitative reasoning exam: This section tests mathematical and analytical skills. Claude 2: A score similar to the median applicant indicates that the LLM has average proficiency in these skills.
- Input length limit: the maximum length of the input prompt that an LLM can handle. A token is a sequence of characters that represents a unit of meaning in natural language processing. Claude 2: A limit of 100K tokens per prompt means that the LLM can handle prompts of up to 100,000 tokens in length.
- Context window limit: Maximum amount of context that an LLM can consider when generating a response to a prompt. Claude 2: A context window of up to 100K means that the LLM can consider up to 100,000 tokens of context when generating a response.
- Code Generation on HumanEval: This test evaluates the LLM's ability to write code that meets certain criteria, such as correctness and efficiency. Claude 2 on Python coding test: 71.2%
- GSM8k math problem set: This problem set evaluates the LLM's ability to solve mathematical problems of varying difficulty. Claude 2: 88%
There are several factors that should be considered while evaluating Large Language Models (LLMs). These include:
- authenticity
- speed
- grammar
- readability
- unbiasedness
- backtracking
- safety
- responsibility
- understanding the context
- text operations
A Survey on Evaluation of Large Language Models
- A Survey on Evaluation of Large Language Models | Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang, Y. Chang,P. Yu, Q. Yang, X. Xie - ARxIV ... The survey covers seven major categories of LLM trustworthiness:
- Reliability: LLMs should be able to consistently generate accurate and truthful outputs, even when presented with new or challenging inputs.
- Safety: LLMs should not generate outputs that are harmful or dangerous, such as outputs that promote violence or hate speech.
- Fairness: LLMs should not discriminate against any individual or group of individuals, regardless of their race, gender, sexual orientation, or other protected characteristics.
- Resistance to misuse: LLMs should be designed in a way that makes it difficult for them to be used for malicious purposes, such as generating fake news or propaganda.
- Explainability and reasoning: LLMs should be able to explain their reasoning behind their outputs, so that users can understand how they work and make informed decisions about how to use them.
- Adherence to social norms: LLMs should generate outputs that are consistent with social norms and values, such as avoiding offensive language or promoting harmful stereotypes.
- Robustness: LLMs should be able to withstand attacks and manipulation, such as being fed deliberately misleading or harmful data.
Popular Benchmarks for Testing LLMs
- AI2 Reasoning Challenge (ARC):designed to promote research in advanced question-answering, particularly questions that require reasoning. The ARC dataset consists of 7,787 science exam questions from grade 3 to grade 9, with a supporting knowledge base of 14.3M unstructured text passages. The benchmark evaluates the performance of LLMs in answering multiple-choice questions.
- WinoGrande: evaluate the ability of LLMs to perform commonsense reasoning. The benchmark consists of 44,000 examples that require the model to understand the meaning of words in context and to reason about the relationships between entities.
- Advanced Reasoning Benchmark (ARB): evaluate the ability of LLMs to perform complex reasoning tasks. The benchmark consists of 1,000 examples that require the model to perform multi-step reasoning and to integrate information from multiple sources.
- Holistic Evaluation of Language Models (HELM): evaluate the performance of LLMs in multiple tasks, including language modeling, question answering, and summarization. The benchmark consists of 57 datasets covering a wide range of tasks and domains.
- Big Bench: evaluate the performance of LLMs in a wide range of tasks, including language modeling, question answering, and summarization. The benchmark consists of 800 diverse tasks that require the model to perform complex reasoning and to integrate information from multiple sources.
- Massive Multitask Language Understanding (MMLU): evaluate the performance of LLMs in multiple tasks, including language modeling, question answering, and summarization. The benchmark consists of 20 diverse tasks that require the model to perform complex reasoning and to integrate information from multiple sources.
- SQuAD: tests LLMs on their ability to answer questions about a given passage of text. The SQuAD dataset is a collection of questions and answers that are created by crowdworkers on a set of Wikipedia articles.
- GLUE: tests LLMs on a variety of natural language understanding tasks, including sentiment analysis, text classification, and question answering.
- SuperGLUE: an extension of the GLUE benchmark that includes more challenging tasks. The tasks in the benchmark are:
- CoLA: Corpus of Linguistic Acceptability
- SST-2: Stanford Sentiment Treebank
- MRPC: Microsoft Research Paraphrase Corpus
- STS-B: Semantic Textual Similarity Benchmark
- QQP: Quora Question Pairs
- MNLI: MultiNLI
- QNLI: Question Natural Language Inference
- RTE: Recognizing Textual Entailment
- WNLI: Winograd Schema Challenge
- AX: Adversarial Textual Entailment
Evaluating Large Language Models on Clinical & Biomedical NLP Benchmarks
Evaluating Large Language Models on Legal Reasoning
LegalBench: The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning -- which distinguish between its many forms -- correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.
Natural Language Processing (NLP) Evaluation
- Natural Language Processing (NLP) ... Generation ... Classification ... Understanding ... Translation ... Tools & Services
General Language Understanding Evaluation (GLUE)
The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems... like picking out the names of people and organizations in a sentence and figuring out what a pronoun like “it” refers to when there are multiple potential antecedents. GLUE consists of:
- A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty,
- A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and
- A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.
|
The Stanford Question Answering Dataset (SQuAD)
|
|
Machine Learning Evaluation
|
|
|
|
Procgen
- OpenAI
- OpenAI’s Procgen Benchmark prevents AI model overfitting | Kyle Wiggers - VentureBeat a set of 16 procedurally generated environments that measure how quickly a model learns generalizable skills. It builds atop the startup’s CoinRun toolset, which used procedural generation to construct sets of training and test levels.
OpenAI previously released Neural MMO, a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available SafetyGym, a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.
Evaluating Machine Learning (ML) Hardware, Software, and Services
- Artificial Intelligence (AI) ... Machine Learning (ML) ... Deep Learning ... Neural Network ... Reinforcement ... Learning Techniques
MLPerf
- MLPerf benchmarks for measuring training and inference performance of ML hardware, software, and services.
- MLCommons debuts with public 86,000-hour speech data set for AI researchers | Devin Coldewey - TechCrunch
|
|
|
|
Backtracking
Backtracking is a general algorithmic technique that considers searching every possible combination in order to solve a computational problem. It incrementally builds candidates to the solutions and abandons a candidate’s backtracks as soon as it determines that the candidate cannot be completed to a reasonable solution. In machine learning, backtracking can be used to solve constraint satisfaction problems, such as crosswords, verbal arithmetic, Sudoku, and many other puzzles.
American Productivity & Quality Center (APQC)
- American Productivity & Quality Center (APQC) ... the world's foremost authority in benchmarking, best practices, process and performance improvement, and knowledge management.
APQC provides the information, data, and insights organizations need to work smarter, faster, and with greater confidence. A non-profit organization, we provide independent, unbiased, and validated research and data to our more than 1,000 organizational members in 45 industries worldwide. Our members have exclusive access to the world’s largest set of benchmark data, with more than 4,000,000 data points. \