Benchmarks

From
Revision as of 12:32, 4 December 2020 by BPeat (talk | contribs) (MLPerf)
Jump to: navigation, search

YouTube search... ...Google search



Lecture 13 – Evaluation Metrics | Stanford CS224U: Natural Language Understanding | Spring 2019
Professor Christopher Potts Professor of Linguistics and, by courtesy, Computer Science Director, Stanford Center for the Study of Language and Information Consulting Assistant Professor Bill MacCartney Senior Engineering Manager, Apple

Kaggle Reading Group : An Open Source AutoML Benchmark | Kaggle
This week we're starting a new paper: An Open Source AutoML Benchmark by Gijsbers et al from the 2019 ICML Workshop on Automated Machine Learning.

Machine Learning Model Evaluation Metrics
MARIA KHALUSOVA | DEVELOPER ADVOCATE AT JETBRAINS Choosing the right evaluation metric for your machine learning project is crucial, as it decides which model you’ll ultimately use. Those coming to ML from software development are often self-taught, but practice exercises and competitions generally dictate the evaluation metric. In a real-world scenario, how do you choose an appropriate metric? This talk will explore the important evaluation metrics used in regression and classification tasks, their pros and cons, and how to make a smart decision.

Characterization and Benchmarking of Deep Learning
In this video from the HPC User Forum in Milwaukee, Natalia Vassilieva from HP Labs presents: Characterization and Benchmarking of Deep Learning.

General Language Understanding Evaluation (GLUE)

The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems... like picking out the names of people and organizations in a sentence and figuring out what a pronoun like “it” refers to when there are multiple potential antecedents. GLUE consists of: A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty, A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.

State of the Art in Natural Language Processing (NLP)
Jeff Heaton Algorithms such as BERT, T4, ERNIE, and others claim to be the state of the art for NLP programs. But what does this mean? How is this evaluated. In this video I look at GLUE and other NLP benchmarks.

The Stanford Question Answering Dataset (SQuAD)

Applying BERT to Question Answering (SQuAD v1.1)
In this video I’ll explain the details of how BERT is used to perform “Question Answering”--specifically, how it’s applied to SQuAD v1.1 (Stanford Question Answering Dataset). I’ll also walk us through the following notebook, where we’ll take a model that’s already been fine-tuned on SQuAD, and apply it to our own questions and text.

Question and Answering System for the SQuAD Dataset
CS224N default final project presentation

MLPerf

MLPerf: A Benchmark Suite for Machine Learning - Gu-Yeon Wei (Harvard University)
O'Reilly

MLPerf: A Benchmark Suite for Machine Learning - David Patterson (UC Berkeley)
O'Reilly

MLPerf Benchmarks
Geoff Tate, CEO of Flex Logix, talks about the new MLPerf benchmark, what’s missing from the benchmark, and which ones are relevant to edge inferencing.

Exploring the Impact of System Storage on AI & ML Workloads via MLPerf Benchmark Suite
Wes Vaske This is the presentation I gave at Flash Memory Summit 2019 in the AI/ML track. In it I discuss some benchmark results that I've collected over the past year at Micron from running the MLPerf benchmark suite. AIML-301-1:Using AI/ML for Flash Performance Scaling, Part 1 -

Procgen

ezgif-4-3630016ea205.gif

OpenAI previously released Neural MMO, a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available SafetyGym, a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.