Difference between revisions of "Benchmarks"

Revision as of 17:40, 27 September 2020

Case Studies
AI Governance / Algorithm Administration
- Data Science / Data Governance
  - Benchmarks
  - Data Preprocessing
    - Feature Exploration/Learning
    - Data Quality ...validity, accuracy, cleaning, completeness, consistency, encoding, padding, augmentation, labeling, auto-tagging, normalization, standardization, and imbalanced data
  - Bias and Variances
  - Master Data Management (MDM)
    - Managed Vocabularies
    - Datasets
  - Privacy in Data Science
  - Data Interoperability
  - Excel - Data Analysis
Visualization
Hyperparameters
Evaluation
- Evaluation - Measures
Train, Validate, and Test
Machine Learning Benchmarks and AI Self-Driving Cars | Lance Eliot - AItrends
Benchmarking simple models with feature extraction against modern black-box methods | Martin Dittgen - Towards Data Science
DAWNBench | Stanford - an End-to-End Deep Learning Benchmark and Competition
Benchmarking 20 Machine Learning Models Accuracy and Speed | Marc Borowczak - Data Science Central
Benchmarking deep learning models on large healthcare datasets | S. Purushotham, C. Meng, Z. Chea, and Y. Liu

Lecture 13 – Evaluation Metrics \| Stanford CS224U: Natural Language Understanding \| Spring 2019 Professor Christopher Potts Professor of Linguistics and, by courtesy, Computer Science Director, Stanford Center for the Study of Language and Information Consulting Assistant Professor Bill MacCartney Senior Engineering Manager, Apple

Kaggle Reading Group : An Open Source AutoML Benchmark \| Kaggle This week we're starting a new paper: An Open Source AutoML Benchmark by Gijsbers et al from the 2019 ICML Workshop on Automated Machine Learning.

Machine Learning Model Evaluation Metrics MARIA KHALUSOVA \| DEVELOPER ADVOCATE AT JETBRAINS Choosing the right evaluation metric for your machine learning project is crucial, as it decides which model you’ll ultimately use. Those coming to ML from software development are often self-taught, but practice exercises and competitions generally dictate the evaluation metric. In a real-world scenario, how do you choose an appropriate metric? This talk will explore the important evaluation metrics used in regression and classification tasks, their pros and cons, and how to make a smart decision.

Characterization and Benchmarking of Deep Learning In this video from the HPC User Forum in Milwaukee, Natalia Vassilieva from HP Labs presents: Characterization and Benchmarking of Deep Learning.

General Language Understanding Evaluation (GLUE)

The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems... like picking out the names of people and organizations in a sentence and figuring out what a pronoun like “it” refers to when there are multiple potential antecedents. GLUE consists of: A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty, A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.

State of the Art in Natural Language Processing (NLP) Jeff Heaton Algorithms such as BERT, T4, ERNIE, and others claim to be the state of the art for NLP programs. But what does this mean? How is this evaluated. In this video I look at GLUE and other NLP benchmarks.

The Stanford Question Answering Dataset (SQuAD)

The Stanford Question Answering Dataset (SQuAD)

Applying BERT to Question Answering (SQuAD v1.1) In this video I’ll explain the details of how BERT is used to perform “Question Answering”--specifically, how it’s applied to SQuAD v1.1 (Stanford Question Answering Dataset). I’ll also walk us through the following notebook, where we’ll take a model that’s already been fine-tuned on SQuAD, and apply it to our own questions and text.

Question and Answering System for the SQuAD Dataset CS224N default final project presentation

MLPerf

MLPerf benchmarks for measuring training and inference performance of ML hardware, software, and services.

MLPerf: A Benchmark Suite for Machine Learning - Gu-Yeon Wei (Harvard University) O'Reilly

MLPerf: A Benchmark Suite for Machine Learning - David Patterson (UC Berkeley) O'Reilly

MLPerf Benchmarks Geoff Tate, CEO of Flex Logix, talks about the new MLPerf benchmark, what’s missing from the benchmark, and which ones are relevant to edge inferencing.

Exploring the Impact of System Storage on AI & ML Workloads via MLPerf Benchmark Suite Wes Vaske This is the presentation I gave at Flash Memory Summit 2019 in the AI/ML track. In it I discuss some benchmark results that I've collected over the past year at Micron from running the MLPerf benchmark suite. AIML-301-1:Using AI/ML for Flash Performance Scaling, Part 1 -

Procgen

OpenAI
OpenAI’s Procgen Benchmark prevents AI model overfitting | Kyle Wiggers - VentureBeat a set of 16 procedurally generated environments that measure how quickly a model learns generalizable skills. It builds atop the startup’s CoinRun toolset, which used procedural generation to construct sets of training and test levels.

OpenAI previously released Neural MMO, a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available SafetyGym, a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.

@@ Line 11: / Line 11: @@
 ** [[Gaming]]
 ** [[Explainable / Interpretable AI]]
-** [[Model Monitoring]]
+** [[Algorithm Administration#Model Monitoring|Model Monitoring]]
 * [[AI Governance]] / [[Algorithm Administration]]
 ** [[Data Science]] / [[Data Governance]]

Difference between revisions of "Benchmarks"

Revision as of 17:40, 27 September 2020

Contents

General Language Understanding Evaluation (GLUE)

The Stanford Question Answering Dataset (SQuAD)

MLPerf

Procgen

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools