Difference between revisions of "Benchmarks"
m |
m |
||
| Line 67: | Line 67: | ||
<youtube>wpQiEHYkBys</youtube> | <youtube>wpQiEHYkBys</youtube> | ||
<b>Machine Learning Model Evaluation Metrics | <b>Machine Learning Model Evaluation Metrics | ||
| − | </b><br>MARIA KHALUSOVA | DEVELOPER ADVOCATE AT JETBRAINS Choosing the right evaluation metric for your machine learning project is crucial, as it decides which model you’ll ultimately use. Those coming to ML from software development are often self-taught, but practice exercises and competitions generally dictate the evaluation metric. In a real-world scenario, how do you choose an appropriate metric? This talk will explore the important evaluation metrics used in regression and classification tasks, their pros and cons, and how to make a smart decision. | + | </b><br>MARIA KHALUSOVA | DEVELOPER ADVOCATE AT JETBRAINS Choosing the right evaluation metric for your machine learning project is crucial, as it decides which model you’ll ultimately use. Those coming to ML from software [[development]] are often self-taught, but practice exercises and competitions generally dictate the evaluation metric. In a real-world scenario, how do you choose an appropriate metric? This talk will explore the important evaluation metrics used in regression and classification tasks, their pros and cons, and how to make a smart decision. |
|} | |} | ||
|<!-- M --> | |<!-- M --> | ||
Revision as of 14:52, 17 March 2023
YouTube search... ...Google search
- Case Studies
- AI Governance / Algorithm Administration
- Visualization
- Hyperparameters
- Evaluation ... Prompts for assessing AI projects
- Train, Validate, and Test
- Machine Learning Benchmarks and AI Self-Driving Cars | Lance Eliot - AItrends
- Benchmarking simple models with feature extraction against modern black-box methods | Martin Dittgen - Towards Data Science
- DAWNBench | Stanford - an End-to-End Deep Learning Benchmark and Competition
- Benchmarking 20 Machine Learning Models Accuracy and Speed | Marc Borowczak - Data Science Central
- Benchmarking deep learning models on large healthcare datasets | S. Purushotham, C. Meng, Z. Chea, and Y. Liu
- Supercomputers Flex Their AI Muscles New benchmarks reveal science-task speedups | Sammuel K. Moore - IEEE Spectrum
|
|
|
|
Contents
General Language Understanding Evaluation (GLUE)
The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems... like picking out the names of people and organizations in a sentence and figuring out what a pronoun like “it” refers to when there are multiple potential antecedents. GLUE consists of: A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty, A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.
|
The Stanford Question Answering Dataset (SQuAD)
|
|
MLPerf
- MLPerf benchmarks for measuring training and inference performance of ML hardware, software, and services.
- MLCommons debuts with public 86,000-hour speech data set for AI researchers | Devin Coldewey - TechCrunch
|
|
|
|
Procgen
- OpenAI
- OpenAI’s Procgen Benchmark prevents AI model overfitting | Kyle Wiggers - VentureBeat a set of 16 procedurally generated environments that measure how quickly a model learns generalizable skills. It builds atop the startup’s CoinRun toolset, which used procedural generation to construct sets of training and test levels.
OpenAI previously released Neural MMO, a “massively multiagent” virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available SafetyGym, a suite of tools for developing AI that respects safety constraints while training, and for comparing the “safety” of algorithms and the extent to which those algorithms avoid mistakes while learning.