Difference between revisions of "Benchmarks"

From
Jump to: navigation, search
m
m
Line 5: Line 5:
 
|description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools  
 
|description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools  
 
}}
 
}}
[http://www.youtube.com/results?search_query=~Benchmark+Benchmarking+machine+learning+Model YouTube search...]
+
[https://www.youtube.com/results?search_query=~Benchmark+Benchmarking+machine+learning+Model YouTube search...]
[http://www.google.com/search?q=~Benchmark+Benchmarking+machine+learning+Model ...Google search]
+
[https://www.google.com/search?q=~Benchmark+Benchmarking+machine+learning+Model ...Google search]
  
 
* [[Case Studies]]
 
* [[Case Studies]]
Line 32: Line 32:
 
*** [[Evaluation - Measures#Specificity|Specificity]]
 
*** [[Evaluation - Measures#Specificity|Specificity]]
 
* [[Train, Validate, and Test]]
 
* [[Train, Validate, and Test]]
* [http://www.aitrends.com/ai-insider/machine-learning-benchmarks-and-ai-self-driving-cars/ Machine Learning Benchmarks and AI Self-Driving Cars | Lance Eliot - AItrends]
+
* [https://www.aitrends.com/ai-insider/machine-learning-benchmarks-and-ai-self-driving-cars/ Machine Learning Benchmarks and AI Self-Driving Cars | Lance Eliot - AItrends]
* [http://towardsdatascience.com/benchmarking-simple-machine-learning-models-with-feature-extraction-against-modern-black-box-80af734b31cc Benchmarking simple models with feature extraction against modern black-box methods | Martin Dittgen - Towards Data Science]
+
* [https://towardsdatascience.com/benchmarking-simple-machine-learning-models-with-feature-extraction-against-modern-black-box-80af734b31cc Benchmarking simple models with feature extraction against modern black-box methods | Martin Dittgen - Towards Data Science]
* [http://dawn.cs.stanford.edu//benchmark/index.html DAWNBench | Stanford] - an End-to-End Deep Learning Benchmark and Competition
+
* [https://dawn.cs.stanford.edu//benchmark/index.html DAWNBench | Stanford] - an End-to-End Deep Learning Benchmark and Competition
* [http://www.datasciencecentral.com/group/resources/forum/topics/benchmarking-20-machine-learning-models-accuracy-and-speed Benchmarking 20 Machine Learning Models Accuracy and Speed | Marc Borowczak - Data Science Central]
+
* [https://www.datasciencecentral.com/group/resources/forum/topics/benchmarking-20-machine-learning-models-accuracy-and-speed Benchmarking 20 Machine Learning Models Accuracy and Speed | Marc Borowczak - Data Science Central]
* [http://www.sciencedirect.com/science/article/pii/S1532046418300716 Benchmarking deep learning models on large healthcare datasets | S. Purushotham, C. Meng, Z. Chea, and Y. Liu]
+
* [https://www.sciencedirect.com/science/article/pii/S1532046418300716 Benchmarking deep learning models on large healthcare datasets | S. Purushotham, C. Meng, Z. Chea, and Y. Liu]
* [http://spectrum.ieee.org/ai-supercomputer#toggle-gdpr Supercomputers Flex Their AI Muscles New benchmarks reveal science-task speedups | Sammuel K. Moore - IEEE Spectrum]
+
* [https://spectrum.ieee.org/ai-supercomputer#toggle-gdpr Supercomputers Flex Their AI Muscles New benchmarks reveal science-task speedups | Sammuel K. Moore - IEEE Spectrum]
  
  
<img src="http://www.researchgate.net/profile/Benoit_Gallix/publication/324457640/figure/fig1/AS:622298201595905@1525378861825/Graph-illustrating-the-impact-of-data-available-on-performance-of-traditional-machine.png" width="500" height="400">
+
<img src="https://www.researchgate.net/profile/Benoit_Gallix/publication/324457640/figure/fig1/AS:622298201595905@1525378861825/Graph-illustrating-the-impact-of-data-available-on-performance-of-traditional-machine.png" width="500" height="400">
  
  
Line 80: Line 80:
  
 
== <span id="GLUE"></span>General Language Understanding Evaluation (GLUE) ==
 
== <span id="GLUE"></span>General Language Understanding Evaluation (GLUE) ==
* [http://gluebenchmark.com/ General Language Understanding Evaluation (GLUE)]
+
* [https://gluebenchmark.com/ General Language Understanding Evaluation (GLUE)]
 
* [[Natural Language Processing (NLP)]]
 
* [[Natural Language Processing (NLP)]]
  
Line 99: Line 99:
  
 
== <span id="SQuAD"></span>The Stanford Question Answering Dataset (SQuAD) ==
 
== <span id="SQuAD"></span>The Stanford Question Answering Dataset (SQuAD) ==
* [http://rajpurkar.github.io/SQuAD-explorer/ The Stanford Question Answering Dataset (SQuAD)]
+
* [https://rajpurkar.github.io/SQuAD-explorer/ The Stanford Question Answering Dataset (SQuAD)]
  
 
{|<!-- T -->
 
{|<!-- T -->
Line 120: Line 120:
  
 
== <span id="MLPerf"></span>MLPerf ==
 
== <span id="MLPerf"></span>MLPerf ==
* [http://mlperf.org/ MLPerf] benchmarks for measuring training and inference performance of ML hardware, software, and services.
+
* [https://mlperf.org/ MLPerf] benchmarks for measuring training and inference performance of ML hardware, software, and services.
* [http://techcrunch.com/2020/12/03/mlcommons-debuts-first-public-database-for-ai-researchers-with-86000-hours-of-speech/ MLCommons debuts with public 86,000-hour speech data set for AI researchers | Devin Coldewey - TechCrunch]
+
* [https://techcrunch.com/2020/12/03/mlcommons-debuts-first-public-database-for-ai-researchers-with-86000-hours-of-speech/ MLCommons debuts with public 86,000-hour speech data set for AI researchers | Devin Coldewey - TechCrunch]
  
 
{|<!-- T -->
 
{|<!-- T -->
Line 160: Line 160:
 
== Procgen ==
 
== Procgen ==
 
* [[OpenAI]]
 
* [[OpenAI]]
* [http://venturebeat.com/2019/12/03/openais-procgen-benchmark-overfitting/ OpenAI’s Procgen Benchmark prevents AI model overfitting | Kyle Wiggers - VentureBeat] a set of 16 procedurally generated environments that measure how quickly a model learns generalizable skills. It builds atop the startup’s CoinRun toolset, which used procedural generation to construct sets of training and test levels.
+
* [https://venturebeat.com/2019/12/03/openais-procgen-benchmark-overfitting/ OpenAI’s Procgen Benchmark prevents AI model overfitting | Kyle Wiggers - VentureBeat] a set of 16 procedurally generated environments that measure how quickly a model learns generalizable skills. It builds atop the startup’s CoinRun toolset, which used procedural generation to construct sets of training and test levels.
  
http://venturebeat.com/wp-content/uploads/2019/12/ezgif-4-3630016ea205.gif
+
https://venturebeat.com/wp-content/uploads/2019/12/ezgif-4-3630016ea205.gif
  
[[OpenAI]] previously released [http://venturebeat.com/2019/03/04/openai-launches-neural-mmo-a-massive-reinforcement-learning-simulator/ Neural MMO], a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available [http://venturebeat.com/2019/11/21/openai-safety-gym/ SafetyGym], a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.
+
[[OpenAI]] previously released [https://venturebeat.com/2019/03/04/openai-launches-neural-mmo-a-massive-reinforcement-learning-simulator/ Neural MMO], a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available [https://venturebeat.com/2019/11/21/openai-safety-gym/ SafetyGym], a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.

Revision as of 20:22, 28 January 2023

YouTube search... ...Google search



Lecture 13 – Evaluation Metrics | Stanford CS224U: Natural Language Understanding | Spring 2019
Professor Christopher Potts Professor of Linguistics and, by courtesy, Computer Science Director, Stanford Center for the Study of Language and Information Consulting Assistant Professor Bill MacCartney Senior Engineering Manager, Apple

Kaggle Reading Group : An Open Source AutoML Benchmark | Kaggle
This week we're starting a new paper: An Open Source AutoML Benchmark by Gijsbers et al from the 2019 ICML Workshop on Automated Machine Learning.

Machine Learning Model Evaluation Metrics
MARIA KHALUSOVA | DEVELOPER ADVOCATE AT JETBRAINS Choosing the right evaluation metric for your machine learning project is crucial, as it decides which model you’ll ultimately use. Those coming to ML from software development are often self-taught, but practice exercises and competitions generally dictate the evaluation metric. In a real-world scenario, how do you choose an appropriate metric? This talk will explore the important evaluation metrics used in regression and classification tasks, their pros and cons, and how to make a smart decision.

Characterization and Benchmarking of Deep Learning
In this video from the HPC User Forum in Milwaukee, Natalia Vassilieva from HP Labs presents: Characterization and Benchmarking of Deep Learning.

General Language Understanding Evaluation (GLUE)

The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems... like picking out the names of people and organizations in a sentence and figuring out what a pronoun like “it” refers to when there are multiple potential antecedents. GLUE consists of: A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty, A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.

State of the Art in Natural Language Processing (NLP)
Jeff Heaton Algorithms such as BERT, T4, ERNIE, and others claim to be the state of the art for NLP programs. But what does this mean? How is this evaluated. In this video I look at GLUE and other NLP benchmarks.

The Stanford Question Answering Dataset (SQuAD)

Applying BERT to Question Answering (SQuAD v1.1)
In this video I’ll explain the details of how BERT is used to perform “Question Answering”--specifically, how it’s applied to SQuAD v1.1 (Stanford Question Answering Dataset). I’ll also walk us through the following notebook, where we’ll take a model that’s already been fine-tuned on SQuAD, and apply it to our own questions and text.

Question and Answering System for the SQuAD Dataset
CS224N default final project presentation

MLPerf

MLPerf: A Benchmark Suite for Machine Learning - Gu-Yeon Wei (Harvard University)
O'Reilly

MLPerf: A Benchmark Suite for Machine Learning - David Patterson (UC Berkeley)
O'Reilly

MLPerf Benchmarks
Geoff Tate, CEO of Flex Logix, talks about the new MLPerf benchmark, what’s missing from the benchmark, and which ones are relevant to edge inferencing.

Exploring the Impact of System Storage on AI & ML Workloads via MLPerf Benchmark Suite
Wes Vaske This is the presentation I gave at Flash Memory Summit 2019 in the AI/ML track. In it I discuss some benchmark results that I've collected over the past year at Micron from running the MLPerf benchmark suite. AIML-301-1:Using AI/ML for Flash Performance Scaling, Part 1 -

Procgen

ezgif-4-3630016ea205.gif

OpenAI previously released Neural MMO, a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available SafetyGym, a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.