Difference between revisions of "Benchmarks"
m |
m |
||
| Line 11: | Line 11: | ||
** [[Gaming]] | ** [[Gaming]] | ||
** [[Explainable / Interpretable AI]] | ** [[Explainable / Interpretable AI]] | ||
| − | ** [[Model Monitoring]] | + | ** [[Algorithm Administration#Model Monitoring|Model Monitoring]] |
* [[AI Governance]] / [[Algorithm Administration]] | * [[AI Governance]] / [[Algorithm Administration]] | ||
** [[Data Science]] / [[Data Governance]] | ** [[Data Science]] / [[Data Governance]] | ||
Revision as of 17:40, 27 September 2020
YouTube search... ...Google search
- Case Studies
- AI Governance / Algorithm Administration
- Visualization
- Hyperparameters
- Evaluation
- Train, Validate, and Test
- Machine Learning Benchmarks and AI Self-Driving Cars | Lance Eliot - AItrends
- Benchmarking simple models with feature extraction against modern black-box methods | Martin Dittgen - Towards Data Science
- DAWNBench | Stanford - an End-to-End Deep Learning Benchmark and Competition
- Benchmarking 20 Machine Learning Models Accuracy and Speed | Marc Borowczak - Data Science Central
- Benchmarking deep learning models on large healthcare datasets | S. Purushotham, C. Meng, Z. Chea, and Y. Liu
|
|
|
|
Contents
General Language Understanding Evaluation (GLUE)
The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems... like picking out the names of people and organizations in a sentence and figuring out what a pronoun like “it” refers to when there are multiple potential antecedents. GLUE consists of: A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty, A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.
|
The Stanford Question Answering Dataset (SQuAD)
|
|
MLPerf
- MLPerf benchmarks for measuring training and inference performance of ML hardware, software, and services.
|
|
|
|
Procgen
- OpenAI
- OpenAI’s Procgen Benchmark prevents AI model overfitting | Kyle Wiggers - VentureBeat a set of 16 procedurally generated environments that measure how quickly a model learns generalizable skills. It builds atop the startup’s CoinRun toolset, which used procedural generation to construct sets of training and test levels.
OpenAI previously released Neural MMO, a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available SafetyGym, a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.