Benchmarks
YouTube ... Quora ...Google search ...Google News ...Bing News
- Data Science ... Governance ... Preprocessing ... Exploration ... Interoperability ... Master Data Management (MDM) ... Bias and Variances ... Benchmarks ... Datasets
- Large Language Model (LLM) ... Natural Language Processing (NLP) ... Generation ... Classification ... Understanding ... Translation ... Tools & Services
- Risk, Compliance and Regulation ... Ethics ... Privacy ... Law ... AI Governance ... AI Verification and Validation
- Case Studies
- ALFRED ... Action Learning From Realistic Environments and Directives
- Algorithm Administration
- Data Quality ...validity, accuracy, cleaning, completeness, consistency, encoding, padding, augmentation, labeling, auto-tagging, normalization, standardization, and imbalanced data
- Managed Vocabularies
- Excel ... Documents ... Database; Vector & Relational ... Graph ... LlamaIndex
- Analytics ... Visualization ... Graphical Tools ... Diagrams & Business Analysis ... Requirements ... Loop ... Bayes ... Network Pattern
- Development ... Notebooks ... AI Pair Programming ... Codeless ... Hugging Face ... AIOps/MLOps ... AIaaS/MLaaS
- Backpropagation ... FFNN ... Forward-Forward ... Activation Functions ...Softmax ... Loss ... Boosting ... Gradient Descent ... Hyperparameter ... Manifold Hypothesis ... PCA
- Strategy & Tactics ... Project Management ... Best Practices ... Checklists ... Project Check-in ... Evaluation ... Measures
- AI Solver ... Algorithms ... Administration ... Model Search ... Discriminative vs. Generative ... Train, Validate, and Test
- Machine Learning Benchmarks and AI Self-Driving Cars | Lance Eliot - AItrends
- Benchmarking simple models with feature extraction against modern black-box methods | Martin Dittgen - Towards Data Science
- DAWNBench | Stanford - an End-to-End Deep Learning Benchmark and Competition
- Benchmarking 20 Machine Learning Models Accuracy and Speed | Marc Borowczak - Data Science Central
- Benchmarking deep learning models on large healthcare datasets | S. Purushotham, C. Meng, Z. Chea, and Y. Liu
- Supercomputers Flex Their AI Muscles New benchmarks reveal science-task speedups | Sammuel K. Moore - IEEE Spectrum
- The Olympics of AI: Benchmarking Machine Learning Systems | Matthew Stewart - Towards Data Science - Medium ... How do benchmarks birth breakthroughs?
You can’t improve what you don’t measure. — Peter Drucker
Contents
AI Consciousness Testing
YouTube ... Quora ...Google search ...Google News ...Bing News
- Artificial General Intelligence (AGI) to Singularity ... Curious Reasoning ... Emergence ... Moonshots ... Explainable AI ... Automated Learning
- Theory of Mind (ToM)
- Perspective ... Context ... In-Context Learning (ICL) ... Transfer Learning ... Out-of-Distribution (OOD) Generalization
- Causation vs. Correlation ... Autocorrelation ...Convolution vs. Cross-Correlation (Autocorrelation)
Turing Test
The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation was a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel, such as a computer keyboard and screen, so the result would not depend on the machine's ability to render words as speech. If the evaluator could not reliably tell the machine from the human, the machine would be said to have passed the test. The test results would not depend on the machine's ability to give correct answers to questions, only on how closely its answers resembled those a human would give. - Turing Test | Wikipedia
Today an AI has to dumb down to pass the Turing Test - Ray Kurzweil
|
|
Chinese Room Thought Experiment
The Chinese Room Experiment is a thought experiment proposed by John Searle in 1980 to argue against the claim that a computer can have a mind or be conscious. Searle's argument has been criticized by some philosophers and computer scientists. However, it remains a powerful argument against the claim that computers can have a mind or be conscious.
In the experiment, Searle imagines himself locked in a room with a set of rules for manipulating Chinese symbols. The rules are written in English, which Searle understands, but the Chinese symbols are meaningless to him. He is given Chinese characters on slips of paper, which he then processes according to the rules. He then produces Chinese characters on slips of paper in response. To an outside observer, it would appear that Searle understands Chinese and is having a conversation with them. However, Searle himself does not understand Chinese at all. He is simply following the rules blindly.
Searle argues that this shows that a computer, which is essentially a machine that follows rules, cannot be said to understand Chinese or to have a mind. The computer may be able to produce intelligent-sounding output, but it does not have the same kind of understanding that a human being has.
The Chinese Room Experiment has been widely discussed and debated by philosophers and computer scientists. Some have argued that Searle's argument is flawed, while others have agreed with his conclusion. The Chinese Room Experiment is a complex and challenging thought experiment, and there is no easy answer to the question of whether or not it succeeds in its goal. However, it is a thought-provoking experiment that has helped to shape the debate about artificial intelligence and the nature of mind. Here are some of the key points of Searle's argument:
- Understanding a language is not just about manipulating symbols according to rules. It also requires having a grasp of the meaning of the symbols.
- A computer can manipulate symbols, but it does not have the same kind of understanding that a human being has.
- The Chinese Room Experiment shows that a computer cannot be said to understand Chinese, even if it can produce intelligent-sounding output.
Large Language Model (LLM) Evaluation
YouTube ... Quora ...Google search ...Google News ...Bing News
- Large Language Model (LLM)
- Conversational AI ... ChatGPT | OpenAI ... Bing/Copilot | Microsoft ... Gemini | Google ... Claude | Anthropic ... Perplexity ... You ... phind ... Ernie | Baidu
- Claude | Anthropic
- LLM Token / Parameter / Weight
- In-Context Learning (ICL) ... Context ... Causation vs. Correlation ... Autocorrelation ... Out-of-Distribution (OOD) Generalization ... Transfer Learning
- Holistic Evaluation of Language Models (HELM) | Stanford ... a living benchmark that aims to improve the transparency of language models.
- Blazingly Fast LLM Evaluation for In-Context Learning | Jeremy Dohmann - Mosaic
- Evals - GitHub ... a framework for evaluating LLMs (large language models) or systems built using LLMs as components.
- Evaluating Large Language Models (LLMs) with Eleuther AI | Bharat Ramanathan - Weights & Biases ... With a flexible and tokenization-agnostic interface, the lm-eval library provides a single framework for evaluating and reporting auto-regressive language models on various Natural Language Understanding (NLU) tasks. There are currently over 200 evaluation tasks that support the evaluation of models such as GPT-2 ,T5, Gpt-J, Gpt-Neo, Gpt-NeoX, Flan-T5.
Benchmarks for an LLM:
- Ability to add attachments to prompts: attachments, such as images or documents, the use of attachments allows an LLM to incorporate additional information beyond the textual prompt, which can improve its ability to generate accurate and relevant responses. Claude 2: prompts can include attachments
- Performance on the bar exam multiple-choice section: The bar exam is a standardized test that is required to practice law in the United States. The multiple-choice section tests knowledge of legal concepts and principles. Claude 2: Scored 76.5%
- Performance on the GRE reading and writing exams: a standardized test that is often required for admission to graduate programs. The reading and writing sections test reading comprehension, analytical writing, and critical thinking skills. Claude 2: A score above the 90th percentile indicates that the LLM is highly proficient in these skills.
- Performance on the GRE quantitative reasoning exam: This section tests mathematical and analytical skills. Claude 2: A score similar to the median applicant indicates that the LLM has average proficiency in these skills.
- Input length limit: the maximum length of the input prompt that an LLM can handle. A token is a sequence of characters that represents a unit of meaning in natural language processing. Claude 2: A limit of 100K tokens per prompt means that the LLM can handle prompts of up to 100,000 tokens in length.
- Context window limit: Maximum amount of context that an LLM can consider when generating a response to a prompt. Claude 2: A context window of up to 100K means that the LLM can consider up to 100,000 tokens of context when generating a response.
- Code Generation on HumanEval: This test evaluates the LLM's ability to write code that meets certain criteria, such as correctness and efficiency. Claude 2 on Python coding test: 71.2%
- GSM8k math problem set: This problem set evaluates the LLM's ability to solve mathematical problems of varying difficulty. Claude 2: 88%
There are several factors that should be considered while evaluating Large Language Models (LLMs). These include:
- authenticity
- speed
- grammar
- readability
- unbiasedness
- backtracking
- safety
- responsibility
- understanding the context
- text operations
A Survey on Evaluation of Large Language Models
- A Survey on Evaluation of Large Language Models | Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang, Y. Chang,P. Yu, Q. Yang, X. Xie - ARxIV ... The survey covers seven major categories of LLM trustworthiness:
- Reliability: LLMs should be able to consistently generate accurate and truthful outputs, even when presented with new or challenging inputs.
- Safety: LLMs should not generate outputs that are harmful or dangerous, such as outputs that promote violence or hate speech.
- Fairness: LLMs should not discriminate against any individual or group of individuals, regardless of their race, gender, sexual orientation, or other protected characteristics.
- Resistance to misuse: LLMs should be designed in a way that makes it difficult for them to be used for malicious purposes, such as generating fake news or propaganda.
- Explainability and reasoning: LLMs should be able to explain their reasoning behind their outputs, so that users can understand how they work and make informed decisions about how to use them.
- Adherence to social norms: LLMs should generate outputs that are consistent with social norms and values, such as avoiding offensive language or promoting harmful stereotypes.
- Robustness: LLMs should be able to withstand attacks and manipulation, such as being fed deliberately misleading or harmful data.
Popular Benchmarks for Testing LLMs
- AI2 Reasoning Challenge (ARC):designed to promote research in advanced question-answering, particularly questions that require reasoning. The ARC dataset consists of 7,787 science exam questions from grade 3 to grade 9, with a supporting knowledge base of 14.3M unstructured text passages. The benchmark evaluates the performance of LLMs in answering multiple-choice questions.
- WinoGrande: evaluate the ability of LLMs to perform commonsense reasoning. The benchmark consists of 44,000 examples that require the model to understand the meaning of words in context and to reason about the relationships between entities.
- Advanced Reasoning Benchmark (ARB): evaluate the ability of LLMs to perform complex reasoning tasks. The benchmark consists of 1,000 examples that require the model to perform multi-step reasoning and to integrate information from multiple sources.
- Holistic Evaluation of Language Models (HELM): evaluate the performance of LLMs in multiple tasks, including language modeling, question answering, and summarization. The benchmark consists of 57 datasets covering a wide range of tasks and domains.
- Big Bench: evaluate the performance of LLMs in a wide range of tasks, including language modeling, question answering, and summarization. The benchmark consists of 800 diverse tasks that require the model to perform complex reasoning and to integrate information from multiple sources.
- Massive Multitask Language Understanding (MMLU): evaluate the performance of LLMs in multiple tasks, including language modeling, question answering, and summarization. The benchmark consists of 20 diverse tasks that require the model to perform complex reasoning and to integrate information from multiple sources.
- SQuAD: tests LLMs on their ability to answer questions about a given passage of text. The SQuAD dataset is a collection of questions and answers that are created by crowdworkers on a set of Wikipedia articles.
- GLUE: tests LLMs on a variety of natural language understanding tasks, including sentiment analysis, text classification, and question answering.
- SuperGLUE: an extension of the GLUE benchmark that includes more challenging tasks. The tasks in the benchmark are:
- CoLA: Corpus of Linguistic Acceptability
- SST-2: Stanford Sentiment Treebank
- MRPC: Microsoft Research Paraphrase Corpus
- STS-B: Semantic Textual Similarity Benchmark
- QQP: Quora Question Pairs
- MNLI: MultiNLI
- QNLI: Question Natural Language Inference
- RTE: Recognizing Textual Entailment
- WNLI: Winograd Schema Challenge
- AX: Adversarial Textual Entailment
Evaluating Large Language Models on Clinical & Biomedical NLP Benchmarks
Evaluating Large Language Models on Legal Reasoning
LegalBench: The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning -- which distinguish between its many forms -- correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.
Natural Language Processing (NLP) Evaluation
- Natural Language Processing (NLP) ... Generation ... Classification ... Understanding ... Translation ... Tools & Services
General Language Understanding Evaluation (GLUE)
The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems... like picking out the names of people and organizations in a sentence and figuring out what a pronoun like “it” refers to when there are multiple potential antecedents. GLUE consists of:
- A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty,
- A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and
- A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.
|
The Stanford Question Answering Dataset (SQuAD)
|
|
Machine Learning Evaluation
Procgen
- OpenAI
- OpenAI’s Procgen Benchmark prevents AI model overfitting | Kyle Wiggers - VentureBeat a set of 16 procedurally generated environments that measure how quickly a model learns generalizable skills. It builds atop the startup’s CoinRun toolset, which used procedural generation to construct sets of training and test levels.
OpenAI previously released Neural MMO, a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available SafetyGym, a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.
Human Evaluation
CAPTCHA stands for "Completely Automated Public Turing test to tell Computers and Humans Apart". It's a security measure that helps protect users from spam and password decryption by verifying that a user is human and not a computer.
I'm not a robot
No CAPTCHA reCAPTCHA: Popularized by Google, this involves a simple checkbox labeled "I am not a robot." It works by analyzing user behavior, such as mouse movements to determine if the user is human. If the test is inconclusive, a more traditional image selection CAPTCHA is presented. When you click the checkbox, reCAPTCHA monitors:
- Mouse movements: Human mouse movements tend to be unpredictable, while bots often exhibit linear or mechanical movements.
- Click timing: Humans have natural delays in their actions, while bots execute them at near-instantaneous speeds.
Improving AI while trying to outsmart it
The so-called "bot test"—like CAPTCHAs, where users identify objects in images or complete other seemingly trivial tasks—has a dual purpose. While it's meant to distinguish between humans and bots, the data collected often helps train AI systems to improve at tasks like image recognition, text understanding, or problem-solving. the effectiveness of CAPTCHAs is constantly being challenged by advancements in artificial intelligence and machine learning. Recent research has demonstrated that advanced AI can effectively solve image-based CAPTCHAs, such as Google's reCAPTCHAv2, with a 100% success rate using YOLO models for image segmentation and classification. This highlights the need for CAPTCHA systems to evolve in response to AI advancements.
In a way, humans doing these tests are teaching the bots to get better at beating the tests themselves. It's a fascinating cycle of humans improving AI while trying to outsmart it! Irony at its finest.
Future Prospects and Innovations
The future of human evaluation CAPTCHA techniques is likely to be shaped by ongoing technological advancements and the need to balance security with user experience. Some promising developments include:
- Advanced AI and Machine Learning Techniques: As AI becomes more sophisticated in solving CAPTCHAs, new techniques are being developed to create CAPTCHA-resistant challenges that can adapt to evolving bot strategies.
- Invisible CAPTCHA: Google's reCAPTCHA v3 represents a significant innovation by eliminating visible challenges for users. Instead, it continuously monitors user behavior to assess the likelihood of a bot interaction, providing a score between 0 and 1.
- Cognitive Deep-Learning CAPTCHA: A 2023 study introduced a new CAPTCHA system that combines text-based, image-based, and cognitive CAPTCHA characteristics. This system employs adversarial examples and neural style transfer to enhance security, making it more resistant to automated attacks.
- Behavioral Analysis and Biometric Verification: Innovations are exploring the use of behavioral analysis to distinguish human actions from bot interactions without explicit challenges. Biometric identification is also being considered for seamless user authentication, leveraging unique user characteristics.
- AI-Powered Solutions: AI algorithms are being developed to create CAPTCHA-resistant challenges that can adapt to evolving bot strategies. This includes employing AI to design intelligent algorithms that better distinguish bot activity from human input.
Evaluating Machine Learning (ML) Hardware, Software, and Services
- Artificial Intelligence (AI) ... Machine Learning (ML) ... Deep Learning ... Neural Network ... Reinforcement ... Learning Techniques
MLPerf
- MLPerf benchmarks for measuring training and inference performance of ML hardware, software, and services.
- MLCommons debuts with public 86,000-hour speech data set for AI researchers | Devin Coldewey - TechCrunch
|
|
|
|
Backtracking
Backtracking is a general algorithmic technique that considers searching every possible combination in order to solve a computational problem. It incrementally builds candidates to the solutions and abandons a candidate’s backtracks as soon as it determines that the candidate cannot be completed to a reasonable solution. In machine learning, backtracking can be used to solve constraint satisfaction problems, such as crosswords, verbal arithmetic, Sudoku, and many other puzzles.
American Productivity & Quality Center (APQC)
- American Productivity & Quality Center (APQC) ... the world's foremost authority in benchmarking, best practices, process and performance improvement, and knowledge management.
APQC provides the information, data, and insights organizations need to work smarter, faster, and with greater confidence. A non-profit organization, we provide independent, unbiased, and validated research and data to our more than 1,000 organizational members in 45 industries worldwide. Our members have exclusive access to the world’s largest set of benchmark data, with more than 4,000,000 data points. \