Difference between revisions of "Algorithm Administration"

From
Jump to: navigation, search
m
m
Line 88: Line 88:
  
 
= Master Data Management (MDM) =
 
= Master Data Management (MDM) =
 +
[http://www.youtube.com/results?search_query=Master+Data+Management+MDM+data+lineage+catalog+management+deep+machine+learning+ai YouTube search...]
 +
[http://www.quora.com/search?q=Master+Data+Management+MDM+data+lineage+catalog+management+deep+machine+learning+ai Quora search...]
 +
[http://www.google.com/search?q=Master+Data+Management+MDM+data+lineage+catalog+management+deep+machine+learning+ai ...Google search]
 
Feature Store / Data Lineage / Data Catalog
 
Feature Store / Data Lineage / Data Catalog
  
Line 163: Line 166:
  
 
= <span id="Versioning"></span>Versioning =
 
= <span id="Versioning"></span>Versioning =
 +
[http://www.youtube.com/results?search_query=~version+versioning+ai YouTube search...]
 +
[http://www.google.com/search?q=~version+versioning+ai ...Google search]
 +
 
* [http://dvc.org/ DVC | DVC.org]
 
* [http://dvc.org/ DVC | DVC.org]
 
* [http://www.pachyderm.com/ Pachyderm]  …[http://medium.com/bigdatarepublic/pachyderm-for-data-scientists-d1d1dff3a2fa Pachyderm for data scientists | Gerben Oostra - bigdata - Medium]
 
* [http://www.pachyderm.com/ Pachyderm]  …[http://medium.com/bigdatarepublic/pachyderm-for-data-scientists-d1d1dff3a2fa Pachyderm for data scientists | Gerben Oostra - bigdata - Medium]
Line 301: Line 307:
  
 
= Automated Learning  =
 
= Automated Learning  =
 +
[http://www.youtube.com/results?search_query=~Automated+~Learning+ai YouTube search...]
 +
[http://www.google.com/search?q=~Automated+~Learning+ai ...Google search]
 +
 +
* [[Other codeless options, Code Generators, Drag n' Drop]]
  
 
Several production machine-learning platforms now offer automatic hyperparameter tuning. Essentially, you tell the system what hyperparameters you want to vary, and possibly what metric you want to optimize, and the system sweeps those hyperparameters across as many runs as you allow. ([[Google Cloud]] hyperparameter tuning extracts the appropriate metric from the TensorFlow model, so you don’t have to specify it.)   
 
Several production machine-learning platforms now offer automatic hyperparameter tuning. Essentially, you tell the system what hyperparameters you want to vary, and possibly what metric you want to optimize, and the system sweeps those hyperparameters across as many runs as you allow. ([[Google Cloud]] hyperparameter tuning extracts the appropriate metric from the TensorFlow model, so you don’t have to specify it.)   
Line 332: Line 342:
  
  
== AutoML ==
+
== <span id="AutoML"></span>AutoML ==
 +
[http://www.youtube.com/results?search_query=AutoML+ai YouTube search...]
 +
[http://www.google.com/search?q=AutoML+ai ...Google search]
 +
 
 +
* [http://en.wikipedia.org/wiki/Automated_machine_learning Automated Machine Learning  (AutoML) | Wikipedia]
 +
* [http://www.automl.org/ AutoML.org]  ...[http://ml.informatik.uni-freiburg.de/ ML Freiburg] ... [http://github.com/automl GitHub] and [http://www.tnt.uni-hannover.de/project/automl/ ML Hannover]
 +
 
  
 
New cloud software suite of machine learning tools. It’s based on Google’s state-of-the-art research in image recognition called [[Neural Architecture]] Search (NAS). NAS is basically an algorithm that, given your specific dataset, searches for the most optimal neural network to perform a certain task on that dataset. AutoML is then a suite of machine learning tools that will allow one to easily train high-performance deep networks, without requiring the user to have any knowledge of deep learning or AI; all you need is labelled data! Google will use NAS to then find the best network for your specific dataset and task. [http://www.kdnuggets.com/2018/08/autokeras-killer-google-automl.html AutoKeras: The Killer of Google’s AutoML | George Seif - KDnuggets]
 
New cloud software suite of machine learning tools. It’s based on Google’s state-of-the-art research in image recognition called [[Neural Architecture]] Search (NAS). NAS is basically an algorithm that, given your specific dataset, searches for the most optimal neural network to perform a certain task on that dataset. AutoML is then a suite of machine learning tools that will allow one to easily train high-performance deep networks, without requiring the user to have any knowledge of deep learning or AI; all you need is labelled data! Google will use NAS to then find the best network for your specific dataset and task. [http://www.kdnuggets.com/2018/08/autokeras-killer-google-automl.html AutoKeras: The Killer of Google’s AutoML | George Seif - KDnuggets]

Revision as of 20:22, 27 September 2020

YouTube search... Quora search... ...Google search


Tools

  • Google AutoML automatically build and deploy state-of-the-art machine learning models
  • SageMaker | Amazon
  • MLOps | Microsoft ...model management, deployment, and monitoring with Azure
  • Ludwig - a Python toolbox from Uber that allows to train and test deep learning models
  • TPOT a Python library that automatically creates and optimizes full machine learning pipelines using genetic programming. Not for NLP, strings need to be coded to numerics.
  • H2O Driverless AI for automated Visualization, feature engineering, model training, hyperparameter optimization, and explainability.
  • alteryx: Feature Labs, Featuretools
  • MLBox Fast reading and distributed data preprocessing/cleaning/formatting. Highly robust feature selection and leak detection. Accurate hyper-parameter optimization in high-dimensional space. State-of-the art predictive models for classification and regression (Deep Learning, Stacking, LightGBM,…). Prediction with models interpretation. Primarily Linux.
  • auto-sklearn algorithm selection and hyperparameter tuning. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction.is a Bayesian hyperparameter optimization layer on top of scikit-learn. Not for large datasets.
  • Auto Keras is an open-source Python package for neural architecture search.
  • ATM -auto tune models - a multi-tenant, multi-data system for automated machine learning (model selection and tuning). ATM is an open source software library under the Human Data Interaction project (HDI) at MIT.
  • Auto-WEKA is a Bayesian hyperparameter optimization layer on top of Weka. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.
  • TransmogrifAI - an AutoML library for building modular, reusable, strongly typed machine learning workflows. A Scala/SparkML library created by Salesforce for automated data cleansing, feature engineering, model selection, and hyperparameter optimization
  • RECIPE - a framework based on grammar-based genetic programming that builds customized scikit-learn classification pipelines.
  • AutoMLC Automated Multi-Label Classification. GA-Auto-MLC and Auto-MEKAGGP are freely-available methods that perform automated multi-label classification on the MEKA software.
  • Databricks MLflow an open source framework to manage the complete Machine Learning lifecycle using Managed MLflow as an integrated service with the Databricks Unified Analytics Platform... ...manage the ML lifecycle, including experimentation, reproducibility and deployment
  • SAS Viya automates the process of data cleansing, data transformations, feature engineering, algorithm matching, model training and ongoing governance.
  • Comet ML ...self-hosted and cloud-based meta machine learning platform allowing data scientists and teams to track, compare, explain and optimize experiments and models
  • Domino Model Monitor (DMM) | Domino ...monitor the performance of all models across your entire organization
  • Weights and Biases ...experiment tracking, model optimization, and dataset versioning
  • SigOpt ...optimization platform and API designed to unlock the potential of modeling pipelines. This fully agnostic software solution accelerates, amplifies, and scales the model development process
  • DVC ...Open-source Version Control System for Machine Learning Projects
  • ModelOp Center | ModelOp
  • Moogsoft and Red Hat Ansible Tower
  • DSS | Dataiku
  • Model Manager | SAS
  • Machine Learning Operations (MLOps) | DataRobot ...build highly accurate predictive models with full transparency
  • Metaflow, Netflix and AWS open source Python library

Master Data Management (MDM)

YouTube search... Quora search... ...Google search Feature Store / Data Lineage / Data Catalog

How is AI changing the game for Master Data Management?
Tony Brownlee talks about the ability to inspect and find data quality issues as one of several ways cognitive computing technology is influencing master data management.

Introducing Roxie. Data Management Meets Artificial Intelligence.
Introducing Roxie, Rubrik's Intelligent Personal Assistant. A hackathon project by Manjunath Chinni. Created in 10 hours with the power of Rubrik APIs.

DAS Webinar: Master Data Management – Aligning Data, Process, and Governance
Getting MDM “right” requires a strategic mix of Data Architecture, business process, and Data Governance.

IBM MDM Feature Spotlight: Machine learning-assisted Data Stewardship
This three minute overview shows the benefits of using machine learning models trained by a clients' own data stewards to facilitate faster resolution of pending clerical tasks in IBM Master Data Management Standard Edition.

Better Machine Learning Outcomes rely on Modern Data Management
Tarun Batra, CEO, LumenData, talks about how the movement towards artificial intelligence and machine learning relies on a Modern Data Management platform that is able to correlate large amounts of data, and provide a reliable data foundation for machine learning algorithms to deliver better business outcomes. In this video, Tarun discusses: Key industry trends driving Modern Data Management, Data management best practices, Creating joint value for customers "There is a lot of movement towards artificial intelligence and machine learning as being the next big domain that organizations are focusing on. With data volumes continuing to increase, and the velocity of change of data, decisions have to be made in an automated, data-driven fashion for organizations to remain competitive. Machine learning can predict and recommend actions, but a reliable data foundation through MDM that continuously manages and ensures data quality is essential for machine learning algorithms to create accurate, meaningful insight." - Tarun Batra

How to manage Artificial Intelligence Data Collection [Enterprise AI Governance Data Management ]
Mind Data AI AI researcher Brian Ka Chan's AI ML DL introduction series. Collecting Data is an important step to the success of Artificial intelligence Program in the 4th industrial Revolution. In the current advancement of Artificial Intelligence technologies, machine learning has always been associated with AI, and in many cases, Machine Learning is considered equivalent of Artifical Intelligence. Machine learning is actually a subset of Artificial Intelligence, this discipline of machine learning relies on data to perform AI training, supervised or unsupervised. On average, 80% of the time that my team spent in AI or Data Sciences projects is about preparing data. Preparing data includes, but not limited to: Identify Data required, Identify the availability of data, and location of them, Profiling the data, Source the data, Integrating the data, Cleanse the data, and prepare the data for learning

What is Data Governance?
Understand what problems a Data Governance program is intended to solve and why the Business Users must own it. Also learn some sample roles that each group might need to play.

Top 10 Mistakes in Data Management
Come learn about the mistakes we most often see organizations make in managing their data. Also learn more about Intricity's Data Management Health Check which you can download here: http://www.intricity.com/intricity101/ To Talk with a Specialist go to: http://www.intricity.com/intricity101/ www.intricity.com


Versioning

YouTube search... ...Google search

How to manage model and data versions
Raj Ramesh Managing data versions and model versions is critical in deploying machine learning models. This is because if you want to re-create the models or go back to fix them, you will need both the data that went into training the model and as well as the model hyperparameters itself. In this video I explained that concept. Here's what I can do to help you. I speak on the topics of architecture and AI, help you integrate AI into your organization, educate your team on what AI can or cannot do, and make things simple enough that you can take action from your new knowledge. I work with your organization to understand the nuances and challenges that you face, and together we can understand, frame, analyze, and address challenges in a systematic way so you see improvement in your overall business, is aligned with your strategy, and most importantly, you and your organization can incrementally change to transform and thrive in the future. If any of this sounds like something you might need, please reach out to me at dr.raj.ramesh@topsigma.com, and we'll get back in touch within a day. Thanks for watching my videos and for subscribing. www.topsigma.com www.linkedin.com/in/rajramesh

Version Control for Data Science Explained in 5 Minutes (No Code!)
In this code-free, five-minute explainer for complete beginners, we'll teach you about Data Version Control (DVC), a tool for adapting Git version control to machine learning projects.

- Why data science and machine learning badly need tools for versioning - Why Git version control alone will fall short - How DVC helps you use Git with big datasets and models - Cool features in DVC, like metrics, pipelines, and plots

Check out the DVC open source project on GitHub: http://github.com/iterative/dvc

How to easily set up and version your Machine Learning pipelines, using Data Version Control (DVC) and Machine Learning Versioning (MLV)-tools | PyData Amsterdam 2019
Stephanie Bracaloni, Sarah Diot-Girard Have you ever heard about Machine Learning versioning solutions? Have you ever tried one of them? And what about automation? Come with us and learn how to easily build versionable pipelines! This tutorial explains through small exercises how to setup a project using DVC and MLV-tools. www.pydata.org

Alessia Marcolini: Version Control for Data Science | PyData Berlin 2019
Track:PyData Are you versioning your Machine Learning project as you would do in a traditional software project? How are you keeping track of changes in your datasets? Recorded at the PyConDE & PyData Berlin 2019 conference. http://pycon.de

Introduction to Pachyderm
Joey Zwicker A high-level introduction to the core concepts and features of Pachyderm as well as a quick demo. Learn more at: pachyderm.io github.com/pachyderm/pachyderm docs.pachyderm.io

E05 Pioneering version control for data science with Pachyderm co-founder and CEO Joe Doliner
5 years ago, Joe Doliner and his co-founder Joey Zwicker decided to focus on the hard problems in data science, rather than building just another dashboard on top of the existing mess. It's been a long road, but it's really payed off. Last year, after an adventurous journey, they closed a $10m Series A led by Benchmark. In this episode, Erasmus Elsner is joined by Joe Doliner to explore what Pachyderm does and how it scaled from just an idea into a fast growing tech company. Listen to the podcast version http://apple.co/2W2g0nV

Model Versioning - ModelDB

  • ModelDB: An open-source system for Machine Learning model versioning, metadata, and experiment management

Hyperparameter

YouTube search... ...Google search

In machine learning, a hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training. Different model training algorithms require different hyperparameters, some simple algorithms (such as ordinary least squares regression) require none. Given these hyperparameters, the training algorithm learns the parameters from the data. Hyperparameter (machine learning) | Wikipedia

Machine learning algorithms train on data to find the best set of weights for each independent variable that affects the predicted value or class. The algorithms themselves have variables, called hyperparameters. They’re called hyperparameters, as opposed to parameters, because they control the operation of the algorithm rather than the weights being determined. The most important hyperparameter is often the learning rate, which determines the step size used when finding the next set of weights to try when optimizing. If the learning rate is too high, the gradient descent may quickly converge on a plateau or suboptimal point. If the learning rate is too low, the gradient descent may stall and never completely converge. Many other common hyperparameters depend on the algorithms used. Most algorithms have stopping parameters, such as the maximum number of epochs, or the maximum time to run, or the minimum improvement from epoch to epoch. Specific algorithms have hyperparameters that control the shape of their search. For example, a Random Forest (or) Random Decision Forest Classifier has hyperparameters for minimum samples per leaf, max depth, minimum samples at a split, minimum weight fraction for a leaf, and about 8 more. Machine learning algorithms explained | Martin Heller - InfoWorld

HPO1.png


Hyperparameter Tuning

Hyperparameters are the variables that govern the training process. Your model parameters are optimized (you could say "tuned") by the training process: you run data through the operations of the model, compare the resulting prediction with the actual value for each data instance, evaluate the accuracy, and adjust until you find the best combination to handle the problem.

These algorithms automatically adjust (learn) their internal parameters based on data. However, there is a subset of parameters that is not learned and that have to be configured by an expert. Such parameters are often referred to as “hyperparameters” — and they have a big impact ...For example, the tree depth in a decision tree model and the number of layers in an artificial neural network are typical hyperparameters. The performance of a model can drastically depend on the choice of its hyperparameters. Machine learning algorithms and the art of hyperparameter selection - A review of four optimization strategies | Mischa Lisovyi and Rosaria Silipo - TNW

There are four commonly used optimization strategies for hyperparameters:

  1. Bayesian optimization
  2. Grid search
  3. Random search
  4. Hill climbing

Bayesian optimization tends to be the most efficient. You would think that tuning as many hyperparameters as possible would give you the best answer. However, unless you are running on your own personal hardware, that could be very expensive. There are diminishing returns, in any case. With experience, you’ll discover which hyperparameters matter the most for your data and choice of algorithms. Machine learning algorithms explained | Martin Heller - InfoWorld

Hyperparameter Optimization libraries:

Tuning:

  • Optimizer type
  • Learning rate (fixed or not)
  • Epochs
  • Regularization rate (or not)
  • Type of Regularization - L1, L2, ElasticNet
  • Search type for local minima
    • Gradient descent
    • Simulated
    • Annealing
    • Evolutionary
  • Decay rate (or not)
  • Momentum (fixed or not)
  • Nesterov Accelerated Gradient momentum (or not)
  • Batch size
  • Fitness measurement type
  • Stop criteria


Automated Learning

YouTube search... ...Google search

Several production machine-learning platforms now offer automatic hyperparameter tuning. Essentially, you tell the system what hyperparameters you want to vary, and possibly what metric you want to optimize, and the system sweeps those hyperparameters across as many runs as you allow. (Google Cloud hyperparameter tuning extracts the appropriate metric from the TensorFlow model, so you don’t have to specify it.)

An emerging class of data science toolkit that is finally making machine learning accessible to business subject matter experts. We anticipate that these innovations will mark a new era in data-driven decision support, where business analysts will be able to access and deploy machine learning on their own to analyze hundreds and thousands of dimensions simultaneously. Business analysts at highly competitive organizations will shift from using visualization tools as their only means of analysis, to using them in concert with AML. Data visualization tools will also be used more frequently to communicate model results, and to build task-oriented user interfaces that enable stakeholders to make both operational and strategic decisions based on output of scoring engines. They will also continue to be a more effective means for analysts to perform inverse analysis when one is seeking to identify where relationships in the data do not exist. 'Five Essential Capabilities: Automated Machine Learning' | Gregory Bonnette

H2O Driverless AI automatically performs feature engineering and hyperparameter tuning, and claims to perform as well as Kaggle masters. AmazonML SageMaker supports hyperparameter optimization. Microsoft Azure Machine Learning AutoML automatically sweeps through features, algorithms, and hyperparameters for basic machine learning algorithms; a separate Azure Machine Learning hyperparameter tuning facility allows you to sweep specific hyperparameters for an existing experiment. Google Cloud AutoML implements automatic deep transfer learning (meaning that it starts from an existing Deep Neural Network (DNN) trained on other data) and neural architecture search (meaning that it finds the right combination of extra network layers) for language pair translation, natural language classification, and image classification. Review: Google Cloud AutoML is truly automated machine learning | Martin Heller

Hyperparameter Tuning with Amazon SageMaker's Automatic Model Tuning - AWS Online Tech Talks
Learn how to use Automatic Model Tuning with Amazon SageMaker to get the best machine learning model for your dataset. Training machine models requires choosing seemingly arbitrary hyperparameters like learning rate and regularization to control the learning algorithm. Traditionally, finding the best values for the hyperparameters requires manual trial-and-error experimentation. Amazon SageMaker makes it easy to get the best possible outcomes for your machine learning models by providing an option to create hyperparameter tuning jobs. These jobs automatically search over ranges of hyperparameters to find the best values. Using sophisticated Bayesian optimization, a meta-model is built to accurately predict the quality of your trained model from the hyperparameters. Learning Objectives:

- Understand what hyperparameters are and what they do for training machine learning models
- Learn how to use Automatic Model Tuning with Amazon SageMaker for creating hyperparameter tuning of your training jobs
- Strategies for choosing and iterating on tuning ranges of a hyperparameter tuning job with Amazon SageMaker 

Automatic Hyperparameter Optimization in Keras for the MediaEval 2018 Medico Multimedia Task
Rune Johan Borgli, Pål Halvorsen, Michael Riegler, Håkon Kvale Stensland, Automatic Hyperparameter Optimization in Keras for the MediaEval 2018 Medico Multimedia Task. Proc. of MediaEval 2018, 29-31 October 2018, Sophia Antipolis, France. Abstract: This paper details the approach to the MediaEval 2018 Medico Multimedia Task made by the Rune team. The decided upon approach uses a work-in-progress hyperparameter optimization system called Saga. Saga is a system for creating the best hyperparameter finding in Keras, a popular machine learning framework, using Bayesian optimization and transfer learning. In addition to optimizing the Keras classifier configuration, we try manipulating the dataset by adding extra images in a class lacking in images and splitting a commonly misclassified class into two classes. Presented by Rune Johan Borgli


AutoML

YouTube search... ...Google search


New cloud software suite of machine learning tools. It’s based on Google’s state-of-the-art research in image recognition called Neural Architecture Search (NAS). NAS is basically an algorithm that, given your specific dataset, searches for the most optimal neural network to perform a certain task on that dataset. AutoML is then a suite of machine learning tools that will allow one to easily train high-performance deep networks, without requiring the user to have any knowledge of deep learning or AI; all you need is labelled data! Google will use NAS to then find the best network for your specific dataset and task. AutoKeras: The Killer of Google’s AutoML | George Seif - KDnuggets



Automatic Machine Learning (AML)

Self-Learning

DARTS: Differentiable Architecture Search

YouTube search... ...Google search

AIOps / MLOps

Youtube search... ...Google search

Machine learning capabilities give IT operations teams contextual, actionable insights to make better decisions on the job. More importantly, AIOps is an approach that transforms how systems are automated, detecting important signals from vast amounts of data and relieving the operator from the headaches of managing according to tired, outdated runbooks or policies. In the AIOps future, the environment is continually improving. The administrator can get out of the impossible business of refactoring rules and policies that are immediately outdated in today’s modern IT environment. Now that we have AI and machine learning technologies embedded into IT operations systems, the game changes drastically. AI and machine learning-enhanced automation will bridge the gap between DevOps and IT Ops teams: helping the latter solve issues faster and more accurately to keep pace with business goals and user needs. How AIOps Helps IT Operators on the Job | Ciaran Byrne - Toolbox


MLOps #28 ML Observability // Aparna Dhinakaran - Chief Product Officer at Arize AI
MLOps.community As more and more machine learning models are deployed into production, it is imperative we have better observability tools to monitor, troubleshoot, and explain their decisions. In this talk, Aparna Dhinakaran, Co-Founder, CPO of Arize AI (Berkeley-based startup focused on ML Observability), will discuss the state of the commonly seen ML Production Workflow and its challenges. She will focus on the lack of model observability, its impacts, and how Arize AI can help. This talk highlights common challenges seen in models deployed in production, including model drift, data qualitydata quality issues, distribution changes, outliers, and bias. The talk will also cover best practices to address these challenges and where observability and explainability can help identify model issues before they impact the business. Aparna will be sharing a demo of how the Arize AI platform can help companies validate their models performance, provide real-time performance monitoring and alerts, and automate troubleshooting of slices of model performance with explainability. The talk will cover best practices in ML Observability and how companies can build more transparency and trust around their models. Aparna Dhinakaran is Chief Product Officer at Arize AI, a startup focused on ML Observability. She was previously an ML engineer at Uber, Apple, and Tubemogul (acquired by Adobe). During her time at Uber, she built a number of core ML Infrastructure platforms including Michaelangelo. She has a bachelors from Berkeley's Electrical Engineering and Computer Science program where she published research with Berkeley's AI Research group. She is on a leave of absence from the Computer Vision PhD program at Cornell University.

Building an MLOps Toolchain The Fundamentals
Artificial intelligence and machine learning are the latest “must-have” technologies in helping organizations realize better business outcomes. However, most organizations don’t have a structured process for rolling out AI-infused applications. Data scientists create AI models in isolation from IT, which then needs to insert those models into applications—and ensure their security—to deliver any business value. In this ebook/webinar, we examine the best way to set up an MLOps process to ensure successful delivery of AI-infused applications.


Continuous Machine Learning (CML)

MLOps Tutorial #1: Intro to Continuous Integration for ML
DVCorg Learn how to use one of the most powerful ideas from the DevOps revolution, continuous integration, in your data science and machine learning projects. This hands-on tutorial shows you how to create an automatic model training & testing setup using GitHub Actions and Continuous Machine Learning (CML), two free and open-source tools in the Git ecosystem. Designed for total beginners! We'll be using: GitHub Actions: http://github.com/features/actions CML: http://github.com/iterative/cml Resources: Code: http://github.com/andronovhopf/wine GitLab support: http://github.com/iterative/cml/wiki

MLOps Tutorial #3: Track ML models with Git & GitHub Actions
DVCorg In this tutorial, we'll compare ML models across two different Git branches of a project- and we'll do it in a continuous integration system (GitHub Actions) for automation superpowers! We'll cover:

- Why comparing model metrics takes more than a git diff - How pipelines, a method for making model training more reproducible, help you standardize model comparisons across Git branches - How to display a table comparing model performance to the main branch in a GitHub Pull Request

Helpful links: Dataset: Data on farmers’ adoption of climate change mitigation measures, individual characteristics, risk attitudes and social influences in a region of Switzerland http://www.sciencedirect.com/science/article/pii/S2352340920303048 Code: http://github.com/elleobrien/farmer DVC pipelines & metrics documentation: http://dvc.org/doc/start/data-pipelines#data-pipelines CML project repo: http://github.com/iterative/cml DVC Discord channel: http://discord.gg/bzA6uY7

Model Monitoring

YouTube search... ...Google search

Monitoring production systems is essential to keeping them running well. For ML systems, monitoring becomes even more important, because their performance depends not just on factors that we have some control over, like infrastructure and our own software, but also on data, which we have much less control over. Therefore, in addition to monitoring standard metrics like latency, traffic, errors and saturation, we also need to monitor model prediction performance. An obvious challenge with monitoring model performance is that we usually don’t have a verified label to compare our model’s predictions to, since the model works on new data. In some cases we might have some indirect way of assessing the model’s effectiveness, for example by measuring click rate for a recommendation model. In other cases, we might have to rely on comparisons between time periods, for example by calculating a percentage of positive classifications hourly and alerting if it deviates by more than a few percent from the average for that time. Just like when validating the model, it’s also important to monitor metrics across slices, and not just globally, to be able to detect problems affecting specific segments. ML Ops: Machine Learning as an Engineering Discipline | Cristiano Breuel - Towards Data Science


Josh Wills: Visibility and Monitoring for Machine Learning Models
Josh started our Meetup with a short talk on deploying machine learning models into production. He's worked as the Director of Data Science at Cloudera, he wrote the Java version of Google's AB testing framework, and he recently held the position of Director of Data Engineering at Slack. In his opinion the most important question is: "How often do you want to deploy this?" You should never deploy a machine learning model once. If the problem is not important enough to keep working on it and deploy new models, then its not important enough to pay the cost of putting it into production in the first place. Watch his talk to get his thoughts on testing machine learning models in production.

Concept Drift: Monitoring Model Quality in Streaming Machine Learning Applications
Most machine learning algorithms are designed to work on stationary data. Yet, real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Here, we review the monitoring methods and evaluate them for applicability in modern fast data and streaming applications.

Monitoring models in production - Jannes Klaas
PyData Amsterdam 2018 A Data Scientists work is not done once machine learning models are in production. In this talk, Jannes will explain ways of monitoring Keras neural network models in production, how to track model decay and set up alerting using Flask, Docker and a range of self-built tools. www.pydata.org

Machine Learning Models in Production
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently. But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch ,offline, inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either. Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs. In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience, DSX, a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev-test-production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos. Speaker: Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM

Scoring Deployed Models

ML Model Deployment and Scoring on the Edge with Automatic ML & DF / Flink2Kafka
recorded on June 18, 2020. Machine Learning Model Deployment and Scoring on the Edge with Automatic Machine Learning and Data Flow Deploying Machine Learning models to the edge can present significant ML/IoT challenges centered around the need for low latency and accurate scoring on minimal resource environments. H2O.ai's Driverless AI AutoML and Cloudera Data Flow work nicely together to solve this challenge. Driverless AI automates the building of accurate Machine Learning models, which are deployed as light footprint and low latency Java or C++ artifacts, also known as a MOJO (Model Optimized). And Cloudera Data Flow leverage Apache NiFi that offers an innovative data flow framework to host MOJOs to make predictions on data moving on the edge. Speakers: James Medel (H2O.ai - Technical Community Maker) Greg Keys (H2O.ai - Solution Engineer) Kafka 2 Flink - An Apache Love Story This project has heavily inspired by two existing efforts from Data In Motion's FLaNK Stack and Data Artisan's blog on stateful streaming applications. The goal of this project is to provide insight into connecting an Apache Flink applications to Apache Kafka. Speaker: Ian R Brooks, PhD (Cloudera - Senior Solutions Engineer & Data)

Shawn Scully: Production and Beyond: Deploying and Managing Machine Learning Models
PyData NYC 2015 Machine learning has become the key component in building intelligence-infused applications. However, as companies increase the number of such deployments, the number of machine learning models that need to be created, maintained, monitored, tracked, and improved grow at a tremendous pace. This growth has lead to a huge (and well-documented) accumulation of technical debt. Developing a machine learning application is an iterative process that involves building multiple models over a dataset. The dataset itself evolves over time as new features and new data points are collected. Furthermore, once deployed, the models require updates over time. Changes in models and datasets become difficult to track over time, and one can quickly lose track of which version of the model used which data and why it was subsequently replaced. In this talk, we outline some of the key challenges in large-scale deployments of many interacting machine learning models. We then describe a methodology for management, monitoring, and optimization of such models in production, which helps mitigate the technical debt. In particular, we demonstrate how to: Track models and versions, and visualize their quality over time Track the provenance of models and datasets, and quantify how changes in data impact the models being served Optimize model ensembles in real time, based on changing data, and provide alerts when such ensembles no longer provide the desired accuracy.


1*ldWxdWDzEYnSbvchuL5k1w.png