Difference between revisions of "Benchmarks"
m |
m |
||
| Line 5: | Line 5: | ||
|description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools | |description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools | ||
}} | }} | ||
| − | [ | + | [https://www.youtube.com/results?search_query=~Benchmark+Benchmarking+machine+learning+Model YouTube search...] |
| − | [ | + | [https://www.google.com/search?q=~Benchmark+Benchmarking+machine+learning+Model ...Google search] |
* [[Case Studies]] | * [[Case Studies]] | ||
| Line 32: | Line 32: | ||
*** [[Evaluation - Measures#Specificity|Specificity]] | *** [[Evaluation - Measures#Specificity|Specificity]] | ||
* [[Train, Validate, and Test]] | * [[Train, Validate, and Test]] | ||
| − | * [ | + | * [https://www.aitrends.com/ai-insider/machine-learning-benchmarks-and-ai-self-driving-cars/ Machine Learning Benchmarks and AI Self-Driving Cars | Lance Eliot - AItrends] |
| − | * [ | + | * [https://towardsdatascience.com/benchmarking-simple-machine-learning-models-with-feature-extraction-against-modern-black-box-80af734b31cc Benchmarking simple models with feature extraction against modern black-box methods | Martin Dittgen - Towards Data Science] |
| − | * [ | + | * [https://dawn.cs.stanford.edu//benchmark/index.html DAWNBench | Stanford] - an End-to-End Deep Learning Benchmark and Competition |
| − | * [ | + | * [https://www.datasciencecentral.com/group/resources/forum/topics/benchmarking-20-machine-learning-models-accuracy-and-speed Benchmarking 20 Machine Learning Models Accuracy and Speed | Marc Borowczak - Data Science Central] |
| − | * [ | + | * [https://www.sciencedirect.com/science/article/pii/S1532046418300716 Benchmarking deep learning models on large healthcare datasets | S. Purushotham, C. Meng, Z. Chea, and Y. Liu] |
| − | * [ | + | * [https://spectrum.ieee.org/ai-supercomputer#toggle-gdpr Supercomputers Flex Their AI Muscles New benchmarks reveal science-task speedups | Sammuel K. Moore - IEEE Spectrum] |
| − | <img src=" | + | <img src="https://www.researchgate.net/profile/Benoit_Gallix/publication/324457640/figure/fig1/AS:622298201595905@1525378861825/Graph-illustrating-the-impact-of-data-available-on-performance-of-traditional-machine.png" width="500" height="400"> |
| Line 80: | Line 80: | ||
== <span id="GLUE"></span>General Language Understanding Evaluation (GLUE) == | == <span id="GLUE"></span>General Language Understanding Evaluation (GLUE) == | ||
| − | * [ | + | * [https://gluebenchmark.com/ General Language Understanding Evaluation (GLUE)] |
* [[Natural Language Processing (NLP)]] | * [[Natural Language Processing (NLP)]] | ||
| Line 99: | Line 99: | ||
== <span id="SQuAD"></span>The Stanford Question Answering Dataset (SQuAD) == | == <span id="SQuAD"></span>The Stanford Question Answering Dataset (SQuAD) == | ||
| − | * [ | + | * [https://rajpurkar.github.io/SQuAD-explorer/ The Stanford Question Answering Dataset (SQuAD)] |
{|<!-- T --> | {|<!-- T --> | ||
| Line 120: | Line 120: | ||
== <span id="MLPerf"></span>MLPerf == | == <span id="MLPerf"></span>MLPerf == | ||
| − | * [ | + | * [https://mlperf.org/ MLPerf] benchmarks for measuring training and inference performance of ML hardware, software, and services. |
| − | * [ | + | * [https://techcrunch.com/2020/12/03/mlcommons-debuts-first-public-database-for-ai-researchers-with-86000-hours-of-speech/ MLCommons debuts with public 86,000-hour speech data set for AI researchers | Devin Coldewey - TechCrunch] |
{|<!-- T --> | {|<!-- T --> | ||
| Line 160: | Line 160: | ||
== Procgen == | == Procgen == | ||
* [[OpenAI]] | * [[OpenAI]] | ||
| − | * [ | + | * [https://venturebeat.com/2019/12/03/openais-procgen-benchmark-overfitting/ OpenAI’s Procgen Benchmark prevents AI model overfitting | Kyle Wiggers - VentureBeat] a set of 16 procedurally generated environments that measure how quickly a model learns generalizable skills. It builds atop the startup’s CoinRun toolset, which used procedural generation to construct sets of training and test levels. |
| − | + | https://venturebeat.com/wp-content/uploads/2019/12/ezgif-4-3630016ea205.gif | |
| − | [[OpenAI]] previously released [ | + | [[OpenAI]] previously released [https://venturebeat.com/2019/03/04/openai-launches-neural-mmo-a-massive-reinforcement-learning-simulator/ Neural MMO], a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available [https://venturebeat.com/2019/11/21/openai-safety-gym/ SafetyGym], a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning. |
Revision as of 20:22, 28 January 2023
YouTube search... ...Google search
- Case Studies
- AI Governance / Algorithm Administration
- Visualization
- Hyperparameters
- Evaluation
- Train, Validate, and Test
- Machine Learning Benchmarks and AI Self-Driving Cars | Lance Eliot - AItrends
- Benchmarking simple models with feature extraction against modern black-box methods | Martin Dittgen - Towards Data Science
- DAWNBench | Stanford - an End-to-End Deep Learning Benchmark and Competition
- Benchmarking 20 Machine Learning Models Accuracy and Speed | Marc Borowczak - Data Science Central
- Benchmarking deep learning models on large healthcare datasets | S. Purushotham, C. Meng, Z. Chea, and Y. Liu
- Supercomputers Flex Their AI Muscles New benchmarks reveal science-task speedups | Sammuel K. Moore - IEEE Spectrum
|
|
|
|
Contents
General Language Understanding Evaluation (GLUE)
The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems... like picking out the names of people and organizations in a sentence and figuring out what a pronoun like “it” refers to when there are multiple potential antecedents. GLUE consists of: A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty, A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.
|
The Stanford Question Answering Dataset (SQuAD)
|
|
MLPerf
- MLPerf benchmarks for measuring training and inference performance of ML hardware, software, and services.
- MLCommons debuts with public 86,000-hour speech data set for AI researchers | Devin Coldewey - TechCrunch
|
|
|
|
Procgen
- OpenAI
- OpenAI’s Procgen Benchmark prevents AI model overfitting | Kyle Wiggers - VentureBeat a set of 16 procedurally generated environments that measure how quickly a model learns generalizable skills. It builds atop the startup’s CoinRun toolset, which used procedural generation to construct sets of training and test levels.
OpenAI previously released Neural MMO, a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available SafetyGym, a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.