- Evaluation Measures - Classification Performance - Accuracy, Precision & Recall (Sensitivity), and Specificity
- Case Studies
- Machine Learning Benchmarks and AI Self-Driving Cars | Lance Eliot - AItrends
- Benchmarking simple models with feature extraction against modern black-box methods | Martin Dittgen - Towards Data Science
- DAWNBench | Stanford - an End-to-End Deep Learning Benchmark and Competition
- Benchmarking 20 Machine Learning Models Accuracy and Speed | Marc Borowczak - Data Science Central
- Benchmarking deep learning models on large healthcare datasets | S. Purushotham, C. Meng, Z. Chea, and Y. Liu
The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems... like picking out the names of people and organizations in a sentence and figuring out what a pronoun like “it” refers to when there are multiple potential antecedents. GLUE consists of: A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty, A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.
The Stanford Question Answering Dataset (SQuAD)
ReAding Comprehension (RACE)
- MLPerf benchmarks for measuring training and inference performance of ML hardware, software, and services.
- OpenAI’s Procgen Benchmark prevents AI model overfitting | Kyle Wiggers - VentureBeat a set of 16 procedurally generated environments that measure how quickly a model learns generalizable skills. It builds atop the startup’s CoinRun toolset, which used procedural generation to construct sets of training and test levels.
OpenAI previously released Neural MMO, a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available SafetyGym, a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.