Difference between revisions of "Data Quality"
m |
m |
||
(39 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
|title=PRIMO.ai | |title=PRIMO.ai | ||
|titlemode=append | |titlemode=append | ||
− | |keywords=artificial, intelligence, machine, learning, models | + | |keywords=ChatGPT, artificial, intelligence, machine, learning, GPT-4, GPT-5, NLP, NLG, NLC, NLU, models, data, singularity, moonshot, Sentience, AGI, Emergence, Moonshot, Explainable, TensorFlow, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Hugging Face, OpenAI, Tensorflow, OpenAI, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Meta, LLM, metaverse, assistants, agents, digital twin, IoT, Transhumanism, Immersive Reality, Generative AI, Conversational AI, Perplexity, Bing, You, Bard, Ernie, prompt Engineering LangChain, Video/Image, Vision, End-to-End Speech, Synthesize Speech, Speech Recognition, Stanford, MIT |description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools |
− | |description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools | + | |
+ | <!-- Google tag (gtag.js) --> | ||
+ | <script async src="https://www.googletagmanager.com/gtag/js?id=G-4GCWLBVJ7T"></script> | ||
+ | <script> | ||
+ | window.dataLayer = window.dataLayer || []; | ||
+ | function gtag(){dataLayer.push(arguments);} | ||
+ | gtag('js', new Date()); | ||
+ | |||
+ | gtag('config', 'G-4GCWLBVJ7T'); | ||
+ | </script> | ||
}} | }} | ||
− | [ | + | [https://www.youtube.com/results?search_query=ai+Data+Quality YouTube] |
− | [ | + | [https://www.quora.com/search?q=ai%20Data%20Quality ... Quora] |
− | [ | + | [https://www.google.com/search?q=ai+Data+Quality ...Google search] |
+ | [https://news.google.com/search?q=ai+Data+Quality ...Google News] | ||
+ | [https://www.bing.com/news/search?q=ai+Data+Quality&qft=interval%3d%228%22 ...Bing News] | ||
− | + | * [[Data Quality]] ...[[AI Verification and Validation|validity]], [[Evaluation - Measures#Accuracy|accuracy]], [[Data Quality#Data Cleaning|cleaning]], [[Data Quality#Data Completeness|completeness]], [[Data Quality#Data Consistency|consistency]], [[Data Quality#Data Encoding|encoding]], [[Data Quality#Zero Padding|padding]], [[Data Quality#Data Augmentation, Data Labeling, and Auto-Tagging|augmentation, labeling, auto-tagging]], [[Data Quality#Batch Norm(alization) & Standardization| normalization, standardization]], and [[Data Quality#Imbalanced Data|imbalanced data]] | |
− | + | * [[Data Science]] ... [[Data Governance|Governance]] ... [[Data Preprocessing|Preprocessing]] ... [[Feature Exploration/Learning|Exploration]] ... [[Data Interoperability|Interoperability]] ... [[Algorithm Administration#Master Data Management (MDM)|Master Data Management (MDM)]] ... [[Bias and Variances]] ... [[Benchmarks]] ... [[Datasets]] | |
− | + | * [[Risk, Compliance and Regulation]] ... [[Ethics]] ... [[Privacy]] ... [[Law]] ... [[AI Governance]] ... [[AI Verification and Validation]] | |
− | + | * [[Natural Language Processing (NLP)#Managed Vocabularies |Managed Vocabularies]] | |
− | + | * [[Excel]] ... [[LangChain#Documents|Documents]] ... [[Database|Database; Vector & Relational]] ... [[Graph]] ... [[LlamaIndex]] | |
− | + | * [[Analytics]] ... [[Visualization]] ... [[Graphical Tools for Modeling AI Components|Graphical Tools]] ... [[Diagrams for Business Analysis|Diagrams]] & [[Generative AI for Business Analysis|Business Analysis]] ... [[Requirements Management|Requirements]] ... [[Loop]] ... [[Bayes]] ... [[Network Pattern]] | |
− | * | + | * [[Development]] ... [[Notebooks]] ... [[Development#AI Pair Programming Tools|AI Pair Programming]] ... [[Codeless Options, Code Generators, Drag n' Drop|Codeless]] ... [[Hugging Face]] ... [[Algorithm Administration#AIOps/MLOps|AIOps/MLOps]] ... [[Platforms: AI/Machine Learning as a Service (AIaaS/MLaaS)|AIaaS/MLaaS]] |
− | + | * [[Backpropagation]] ... [[Feed Forward Neural Network (FF or FFNN)|FFNN]] ... [[Forward-Forward]] ... [[Activation Functions]] ...[[Softmax]] ... [[Loss]] ... [[Boosting]] ... [[Gradient Descent Optimization & Challenges|Gradient Descent]] ... [[Algorithm Administration#Hyperparameter|Hyperparameter]] ... [[Manifold Hypothesis]] ... [[Principal Component Analysis (PCA)|PCA]] | |
− | * | + | * [[Strategy & Tactics]] ... [[Project Management]] ... [[Best Practices]] ... [[Checklists]] ... [[Project Check-in]] ... [[Evaluation]] ... [[Evaluation - Measures|Measures]] |
− | * | + | * [[AI Solver]] ... [[Algorithms]] ... [[Algorithm Administration|Administration]] ... [[Model Search]] ... [[Discriminative vs. Generative]] ... [[Train, Validate, and Test]] |
− | * | + | * [[Artificial General Intelligence (AGI) to Singularity]] ... [[Inside Out - Curious Optimistic Reasoning| Curious Reasoning]] ... [[Emergence]] ... [[Moonshots]] ... [[Explainable / Interpretable AI|Explainable AI]] ... [[Algorithm Administration#Automated Learning|Automated Learning]] |
− | * | + | * [https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007 The AI Hierarchy of Needs | Monica Rogati - Hackernoon] |
− | * | + | * [https://greatexpectations.io/ Great Expectations] ...helps data teams eliminate pipeline debt, through data testing, documentation, and profiling. |
− | * [[ | ||
− | |||
− | |||
− | |||
− | * [[Train, Validate, and Test]] | ||
− | * [[ | ||
− | * [ | ||
− | * [ | ||
− | <img src=" | + | <img src="https://hackernoon.com/hn-images/1*7IMev5xslc9FLxr9hHhpFw.png" width="800"> |
Line 41: | Line 44: | ||
<youtube>aUGCxTgvFf0</youtube> | <youtube>aUGCxTgvFf0</youtube> | ||
<b>Testing and Documenting Your Data Doesn't Have to Suck | Superconductive | <b>Testing and Documenting Your Data Doesn't Have to Suck | Superconductive | ||
− | </b><br>Data teams everywhere struggle with pipeline debt: untested, undocumented assumptions that drain productivity, erode trust in data and kill team morale. Unfortunately, rolling your own data validation tooling usually takes weeks or months. In addition, most teams suffer from “documentation rot,” where data documentation is hard to maintain, and therefore chronically outdated, incomplete, and only semi-trusted. Great Expectations - | + | </b><br>Data teams everywhere struggle with pipeline debt: untested, undocumented assumptions that drain productivity, erode trust in data and kill team morale. Unfortunately, rolling your own data validation tooling usually takes weeks or months. In addition, most teams suffer from “documentation rot,” where data documentation is hard to maintain, and therefore chronically outdated, incomplete, and only semi-trusted. Great Expectations - https://bit.ly/2OtmY1W, the leading open source project for fighting pipeline debt, can solve these problems for you. We're excited to share new features and under-the-hood architecture with the data community. ABOUT THE SPEAKER |
− | Abe Gong is a core contributor to the Great Expectations open source library, and CEO and Co-founder at Superconductive. Prior to Superconductive, Abe was Chief Data Officer at Aspire Health, the founding member of the Jawbone data science team, and lead data scientist at Massive Health. Abe has been leading teams using data and technology to solve problems in health care, consumer wellness, and public policy for over a decade. Abe earned his PhD at the University of Michigan in Public Policy, Political Science, and Complex Systems. He speaks and writes regularly on data, healthcare, and data ethics. | + | Abe Gong is a core contributor to the Great Expectations open source library, and CEO and Co-founder at Superconductive. Prior to Superconductive, Abe was Chief Data Officer at Aspire Health, the founding member of the Jawbone data science team, and lead data scientist at Massive Health. Abe has been leading teams using data and technology to solve problems in health care, consumer wellness, and public policy for over a decade. Abe earned his PhD at the University of Michigan in Public Policy, Political Science, and Complex Systems. He speaks and writes regularly on data, healthcare, and data [[ethics]]. |
|} | |} | ||
|<!-- M --> | |<!-- M --> | ||
Line 59: | Line 62: | ||
<youtube>t7vHpA39TXM</youtube> | <youtube>t7vHpA39TXM</youtube> | ||
<b>An Approach to Data Quality for Netflix Personalization Systems | <b>An Approach to Data Quality for Netflix Personalization Systems | ||
− | </b><br>Personalization is one of the key pillars of Netflix as it enables each member to experience the vast collection of content tailored to their interests. Our personalization system is powered by several machine learning models. These models are only as good as the data that is fed to them. They are trained using hundreds of terabytes of data everyday, that make it a non-trivial challenge to track and maintain data quality. To ensure high data quality, we require three things: automated monitoring of data; visualization to observe changes in the metrics over time; and mechanisms to control data related regressions, wherein a data regression is defined as data loss or distributional shifts over a given period of time. In this talk, we will describe infrastructure and methods that we used to achieve the above: – ‘Swimlanes’ that help us define data boundaries for different environments that are used to develop, evaluate and deploy ML models, – Pipelines that aggregate data metrics from various sources within each swimlane – Time series and dashboard visualization tools across an atypically larger period of time – Automated audits that periodically monitor these metrics to detect data regressions. We will explain how we run aggregation jobs to optimize metric computations, SQL queries to quickly define/test individual metrics and other ETL jobs to power the visualization/audits tools using Spark.’ About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business. Connect with us: Website: | + | </b><br>Personalization is one of the key pillars of Netflix as it enables each member to experience the vast collection of content tailored to their interests. Our personalization system is powered by several machine learning models. These models are only as good as the data that is fed to them. They are trained using hundreds of terabytes of data everyday, that make it a non-trivial challenge to track and maintain data quality. To ensure high data quality, we require three things: automated monitoring of data; visualization to observe changes in the metrics over time; and mechanisms to control data related regressions, wherein a data regression is defined as data loss or distributional shifts over a given period of time. In this talk, we will describe infrastructure and methods that we used to achieve the above: – ‘Swimlanes’ that help us define data boundaries for different environments that are used to develop, evaluate and deploy ML models, – Pipelines that aggregate data metrics from various sources within each swimlane – Time series and dashboard visualization tools across an atypically larger period of time – Automated audits that periodically monitor these metrics to detect data regressions. We will explain how we run aggregation jobs to optimize metric computations, SQL queries to quickly define/test individual metrics and other ETL jobs to power the visualization/audits tools using Spark.’ About: [[Databricks]] provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business. Connect with us: Website: https://databricks.com [[Meta|Facebook]]: https://www.facebook.com/databricksinc |
|} | |} | ||
|<!-- M --> | |<!-- M --> | ||
Line 72: | Line 75: | ||
= <span id="Sourcing Data"></span>Sourcing Data = | = <span id="Sourcing Data"></span>Sourcing Data = | ||
− | [ | + | [https://www.youtube.com/results?search_query=Data+Sourcing+machine+learning+ML YouTube search...] |
− | [ | + | [https://www.google.com/search?q=Data+Sourcing+machine+learning+ML ...Google search] |
− | * [ | + | * [https://eclass.teicrete.gr/modules/document/file.php/DLH105/Research%20Methods%20for%20Business%20Students%2C%205th%20Edition.pdf Research methods for business students | M. Saunders, P. Lewis, and A. Thornhill] |
− | * [ | + | * [https://www.solvexia.com/blog/10-data-sourcing-best-practices-for-reporting 10 Data Sourcing Best Practices for Reporting | SolveXia] |
− | * [ | + | * [https://synthio.com/b2b-blog/dos-nots-data-sourcing/ Do’s and Do Not’s of Data Sourcing | Synthio] |
− | <img src=" | + | <img src="https://15writers.com/wp-content/uploads/2019/05/research-onion.jpg" width="800"> |
{|<!-- T --> | {|<!-- T --> | ||
Line 101: | Line 104: | ||
= <span id="Data Cleaning"></span>Data Cleaning = | = <span id="Data Cleaning"></span>Data Cleaning = | ||
− | [ | + | [https://www.youtube.com/results?search_query=Data+Cleaning+machine+learning+ML YouTube search...] |
− | [ | + | [https://www.google.com/search?q=Data+Cleaning+machine+learning+ML ...Google search] |
− | * [ | + | * [https://www.kaggle.com/rtatman/data-cleaning-challenge-json-txt-and-xls/ Data Cleaning Challenge: .json, .txt and .xls | Rachael Tatman] |
− | * [ | + | * [https://towardsdatascience.com/machine-learning-for-data-cleaning-and-unification-b3213bbd18e Machine learning for data cleaning and unification | Abizer Jafferjee - Towards Data Science] |
− | * [ | + | * [https://www.infoworld.com/article/3394399/machine-learning-algorithms-explained.html Machine learning algorithms explained | Martin Heller - InfoWorld] |
− | * [ | + | * [https://www.trifacta.com/ From Messy Files To Automated Analytics | Trifacta] |
− | * [ | + | * [https://www.paxata.com/ The Data Prep for AI Toolkit: Smarter ML Models Through Faster, More Accurate Data Prep | Paxata] |
− | * [ | + | * [https://www.alteryx.com/e-book/age-badass-analyst The Age of The Badass Analyst | Alteryx] |
− | When it comes to utilizing ML data, most of the time is spent on cleaning data sets or creating a dataset that is free of errors. Setting up a quality plan, filling missing values, removing rows, reducing data size are some of the best practices used for data cleaning in Machine Learning. [ | + | When it comes to utilizing ML data, most of the time is spent on cleaning data sets or creating a dataset that is free of errors. Setting up a quality plan, filling missing values, removing rows, reducing data size are some of the best practices used for data cleaning in Machine Learning. [https://www.einfochips.com/blog/data-cleaning-in-machine-learning-best-practices-and-methods/ Data Cleaning in Machine Learning: Best Practices and Methods | Smishad Thomas] |
− | Overall, incorrect data is either removed, corrected, or imputed... [ | + | Overall, incorrect data is either removed, corrected, or imputed... [https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4 The Ultimate Guide to Data Cleaning | Omar Elgabry - Towards Data Science] |
− | # Irrelevant data - are those that are not actually needed, and don’t fit under the context of the problem we’re trying to solve. | + | # Irrelevant data - are those that are not actually needed, and don’t fit under the [[context]] of the problem we’re trying to solve. |
# Duplicates - are data points that are repeated in your dataset. | # Duplicates - are data points that are repeated in your dataset. | ||
# Type conversion - Make sure numbers are stored as numerical data types. A date should be stored as a date object, or a Unix timestamp (number of seconds), and so on. Categorical values can be converted into and from numbers if needed. | # Type conversion - Make sure numbers are stored as numerical data types. A date should be stored as a date object, or a Unix timestamp (number of seconds), and so on. Categorical values can be converted into and from numbers if needed. | ||
Line 138: | Line 141: | ||
<youtube>NsD6Wn4KSFY</youtube> | <youtube>NsD6Wn4KSFY</youtube> | ||
<b>Machine Learning Tutorial 11 - Cleaning Bad Data | <b>Machine Learning Tutorial 11 - Cleaning Bad Data | ||
− | </b><br>Best Machine Learning book: https://amzn.to/2MilWH0 (Fundamentals Of Machine Learning for Predictive Data Analytics). Machine Learning and Predictive Analytics. #MachineLearning One of the processes in machine learning is data cleaning. This is the process of eliminating bad data and performing the needed transformations to make our data suitable for a machine learning algorithm. This online course covers big data analytics stages using machine learning and | + | </b><br>Best Machine Learning book: https://amzn.to/2MilWH0 (Fundamentals Of Machine Learning for Predictive Data Analytics). Machine Learning and [[Predictive Analytics]]. #MachineLearning One of the processes in machine learning is data cleaning. This is the process of eliminating bad data and performing the needed transformations to make our data suitable for a machine learning algorithm. This online course covers big data analytics stages using machine learning and [[Predictive Analytics]]. Big data and [[Predictive Analytics]] is one of the most popular applications of machine learning and is foundational to getting deeper insights from data. Starting off, this course will cover machine learning algorithms, supervised learning, data planning, data cleaning, data visualization, models, and more. This self paced series is perfect if you are pursuing an online computer science degree, online data science degree, online artificial intelligence degree, or if you just want to get more machine learning experience. Enjoy! |
|} | |} | ||
|<!-- M --> | |<!-- M --> | ||
Line 146: | Line 149: | ||
<youtube>JkYE7ghu1UE</youtube> | <youtube>JkYE7ghu1UE</youtube> | ||
<b>Machine Learning Tutorial 12 - Cleaning Missing Values (NULL) | <b>Machine Learning Tutorial 12 - Cleaning Missing Values (NULL) | ||
− | </b><br>Best Machine Learning book: https://amzn.to/2MilWH0 (Fundamentals Of Machine Learning for Predictive Data Analytics). Machine Learning and Predictive Analytics. #MachineLearning One of the processes in machine learning is data cleaning. This video deals specifically with missing values. | + | </b><br>Best Machine Learning book: https://amzn.to/2MilWH0 (Fundamentals Of Machine Learning for Predictive Data Analytics). Machine Learning and [[Predictive Analytics]]. #MachineLearning One of the processes in machine learning is data cleaning. This video deals specifically with missing values. |
− | This online course covers big data analytics stages using machine learning and | + | This online course covers big data analytics stages using machine learning and [[Predictive Analytics]]. Big data and [[Predictive Analytics]] is one of the most popular applications of machine learning and is foundational to getting deeper insights from data. Starting off, this course will cover machine learning algorithms, supervised learning, data planning, data cleaning, data visualization, models, and more. This self paced series is perfect if you are pursuing an online computer science degree, online data science degree, online artificial intelligence degree, or if you just want to get more machine learning experience. Enjoy! |
|} | |} | ||
|}<!-- B --> | |}<!-- B --> | ||
Line 156: | Line 159: | ||
<youtube>hEmSa1bJZpk</youtube> | <youtube>hEmSa1bJZpk</youtube> | ||
<b>Missing data in various features:Data cleaning and understanding | Applied AI Course | <b>Missing data in various features:Data cleaning and understanding | Applied AI Course | ||
− | </b><br>In this video lets see how to deal with missing values. Applied AI course (AAIC Technologies Pvt. Ltd.) is an Ed-Tech company based out in Hyderabad offering on-line training in Machine Learning and Artificial intelligence. Applied AI course through its unparalleled curriculum aims to bridge the gap between industry requirements and skill set of aspiring candidates by churning out highly skilled machine learning professionals who are well prepared to tackle real world business problems. For More information Please visit | + | </b><br>In this video lets see how to deal with missing values. Applied AI course (AAIC Technologies Pvt. Ltd.) is an Ed-Tech company based out in Hyderabad offering on-line training in Machine Learning and Artificial intelligence. Applied AI course through its unparalleled curriculum aims to bridge the gap between industry requirements and skill set of aspiring candidates by churning out highly skilled machine learning professionals who are well prepared to tackle real world business problems. For More information Please visit https://www.appliedaicourse.com/ |
|} | |} | ||
|<!-- M --> | |<!-- M --> | ||
Line 169: | Line 172: | ||
= <span id="Data Encoding"></span>Data Encoding = | = <span id="Data Encoding"></span>Data Encoding = | ||
− | [ | + | [https://www.youtube.com/results?search_query=Data+Encoding+machine+learning+ML YouTube search...] |
− | [ | + | [https://www.google.com/search?q=Data+Encoding+machine+learning+ML ...Google search] |
* [[...predict categories]] with classification | * [[...predict categories]] with classification | ||
Line 177: | Line 180: | ||
To use categorical data for machine classification, you need to encode the text labels into another form. There are two common encodings. | To use categorical data for machine classification, you need to encode the text labels into another form. There are two common encodings. | ||
# One is label encoding, which means that each text label value is replaced with a number. | # One is label encoding, which means that each text label value is replaced with a number. | ||
− | # The other is one-hot encoding, which means that each text label value is turned into a column with a binary value (1 or 0). Most machine learning frameworks have functions that do the conversion for you. In general, one-hot encoding is preferred, as label encoding can sometimes confuse the machine learning algorithm into thinking that the encoded column is ordered. [ | + | # The other is one-hot encoding, which means that each text label value is turned into a column with a binary value (1 or 0). Most machine learning frameworks have functions that do the conversion for you. In general, one-hot encoding is preferred, as label encoding can sometimes confuse the machine learning algorithm into thinking that the encoded column is ordered. [https://www.infoworld.com/article/3394399/machine-learning-algorithms-explained.html Machine learning algorithms explained | Martin Heller - InfoWorld] |
{|<!-- T --> | {|<!-- T --> | ||
Line 216: | Line 219: | ||
= <span id="Data Augmentation, Data Labeling, and Auto-Tagging"></span>Data Augmentation, Data Labeling, and Auto-Tagging = | = <span id="Data Augmentation, Data Labeling, and Auto-Tagging"></span>Data Augmentation, Data Labeling, and Auto-Tagging = | ||
− | [ | + | [https://www.youtube.com/results?search_query=Data+Augmentation Youtube search...] |
− | [ | + | [https://www.google.com/search?q=Data+Augmentation+deep+machine+learning+ML ...Google search] |
− | * [ | + | * [https://medium.com/nanonets/nanonets-how-to-use-deep-learning-when-you-have-limited-data-f68c0b512cab Data Augmentation | How to use Deep Learning when you have Limited Data | Bharath Raj] |
− | * [ | + | * [https://www.kaggle.com/c/passenger-screening-algorithm-challenge/discussion/45805 Passenger Screening - How Data Augmentation helped to win] |
− | * Tools: [ | + | * Tools: [https://labelbox.com/ Labelbox], [https://scale.com/ Scale AI], [https://appen.com/ Appen], [[Amazon]] [[SageMaker]], [[Google]]AI, [[Microsoft]] Azure Machine Learning |
* Data Augmentation as a best practice for addressing the [[Overfitting Challenge]] | * Data Augmentation as a best practice for addressing the [[Overfitting Challenge]] | ||
− | * [ | + | * [https://scale.com Scale] training and validation data for AI applications. After sending us your data via API call, our platform through a combination of human work and review, smart tools, statistical confidence checks and machine learning checks returns scalable, accurate ground truth data. |
− | + | * [https://snorkel.ai/ Snorkel AI] | |
+ | * [https://info.cloudfactory.com/ Cloud Factory] | ||
+ | * [https://labelbox.com/ Labelbox] | ||
+ | * [https://scale.com/ Scale AI] | ||
− | Data augmentation is the process of using the data you currently have and modifying it in a realistic but randomized way, to increase the variety of data seen during training. As an example for images, slightly rotating, zooming, and/or translating the image will result in the same content, but with a different framing. This is representative of the real-world scenario, so will improve the training. It's worth double-checking that the output of the data augmentation is still realistic. To determine what types of augmentation to use, and how much of it, do some trial and error. Try each augmentation type on a sample set, with a variety of settings (e.g. 1% translation, 5% translation, 10% translation) and see what performs best on the sample set. Once you know the best setting for each augmentation type, try adding them all at the same time. [ | + | Data augmentation is the process of using the data you currently have and modifying it in a realistic but randomized way, to increase the variety of data seen during training. As an example for images, slightly rotating, zooming, and/or translating the image will result in the same content, but with a different framing. This is representative of the real-world scenario, so will improve the training. It's worth double-checking that the output of the data augmentation is still realistic. To determine what types of augmentation to use, and how much of it, do some trial and error. Try each augmentation type on a sample set, with a variety of settings (e.g. 1% translation, 5% translation, 10% translation) and see what performs best on the sample set. Once you know the best setting for each augmentation type, try adding them all at the same time. [https://wiki.fast.ai/index.php?title=Over-fitting | Deep Learning Course Wiki] |
Note: In [[Keras]], we can perform transformations using ImageDataGenerator. | Note: In [[Keras]], we can perform transformations using ImageDataGenerator. | ||
− | + | https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/04/Screen-Shot-2018-04-04-at-12.14.45-AM.png | |
Line 273: | Line 279: | ||
<youtube>CfG3S0yarsk</youtube> | <youtube>CfG3S0yarsk</youtube> | ||
<b>FastAI Webinar Series: Part 5 - Training with Data Augmentation | <b>FastAI Webinar Series: Part 5 - Training with Data Augmentation | ||
− | </b><br>Aakash N S [ | + | </b><br>Aakash N S [https://www.kaggle.com/aakashns/fastai-002b-augment-training Code] |
|} | |} | ||
|<!-- M --> | |<!-- M --> | ||
Line 280: | Line 286: | ||
|| | || | ||
<youtube>Mtf6bdhuUUU</youtube> | <youtube>Mtf6bdhuUUU</youtube> | ||
− | <b>YOW! Data 2018 - Atif Rahman - Privacy Preserved Data Augmentation #YOWData | + | <b>YOW! Data 2018 - Atif Rahman - [[Privacy]] Preserved Data Augmentation #YOWData |
− | </b><br>Enterprises hold data that has potential value outside their own firewalls. We have been trying to figure out how to share such data at a level of detail with others in a secure, safe, legal and risk mitigated manner that ensure high level of privacy while adding tangible economic and social value. Enterprises are facing numerous roadblocks, failed projects, inadequate business cases, and issues of scale that needs newer techniques, technology and approach. In this talk, we will be setup the groundwork for scalable data augmentation for organizations and visualizing technical architectures and solutions around emerging technologies of data fabrics, edge computing and a second coming of data virtualisation. A self-assessment toolkit will be shared for people interested to apply it to their organizations. | + | </b><br>Enterprises hold data that has potential value outside their own firewalls. We have been trying to figure out how to share such data at a level of detail with others in a secure, safe, legal and risk mitigated manner that ensure high level of [[privacy]] while adding tangible economic and social value. Enterprises are facing numerous roadblocks, failed projects, inadequate business cases, and issues of scale that needs newer techniques, technology and approach. In this talk, we will be setup the groundwork for scalable data augmentation for organizations and visualizing technical architectures and solutions around emerging technologies of data fabrics, edge computing and a second coming of data virtualisation. A self-assessment toolkit will be shared for people interested to apply it to their organizations. |
|} | |} | ||
|}<!-- B --> | |}<!-- B --> | ||
− | == [ | + | == [https://www.techopedia.com/definition/28033/data-augmentation What does Data Augmentation mean? | Techopedia] == |
Data augmentation adds value to base data by adding information derived from internal and external sources within an enterprise. Data is one of the core assets for an enterprise, making data management essential. Data augmentation can be applied to any form of data, but may be especially useful for customer data, sales patterns, product sales, where additional information can help provide more in-depth insight. Data augmentation can help reduce the manual intervention required to developed meaningful information and insight of business data, as well as significantly enhance [[Data Quality|data quality]]. | Data augmentation adds value to base data by adding information derived from internal and external sources within an enterprise. Data is one of the core assets for an enterprise, making data management essential. Data augmentation can be applied to any form of data, but may be especially useful for customer data, sales patterns, product sales, where additional information can help provide more in-depth insight. Data augmentation can help reduce the manual intervention required to developed meaningful information and insight of business data, as well as significantly enhance [[Data Quality|data quality]]. | ||
Line 299: | Line 305: | ||
== Data Labeling == | == Data Labeling == | ||
− | [ | + | [https://www.youtube.com/results?search_query=Data+Labeling Youtube search...] |
− | [ | + | [https://www.google.com/search?q=Data+Labeling+deep+machine+learning+ML ...Google search] |
− | * [ | + | * [https://venturebeat.com/2019/06/12/essential-tips-for-scaling-quality-ai-data-labeling/ Essential tips for scaling quality AI data labeling | Damian Rochman - VentureBeat] |
− | * [ | + | * [https://towardsdatascience.com/four-mistakes-you-make-when-labeling-data-7e431c4438a2 Four Mistakes You Make When Labeling Data | Tal Perry Towards Data Science] |
− | * [ | + | * [https://labelbox.com/learn/buy-vs-build Building vs. Buying a training data annotation solution | Labelbox] |
− | * [ | + | * [https://medium.com/memory-leak/data-labeling-creating-ground-truth-44e64da6cc4f Data Labeling: Creating Ground Truth | Astasia Myers - Medium] |
* Tools/Services: | * Tools/Services: | ||
− | ** [ | + | ** [https://cloud.google.com/ai-platform/ AI Platform Data Labeling Service | Google] |
− | ** [ | + | ** [https://humansintheloop.org Humans in the Loop] |
− | ** [ | + | ** [https://hcaptcha.com/labeling hCaptcha] |
− | ** [ | + | ** [https://www.alegion.com/ Alegion] |
− | ** [ | + | ** [https://www.clickworker.com/crowdsourcing-glossary/data-labeling/ Clickworker] |
− | ** [ | + | ** [https://www.globalme.net/data-labeling-classification Globalme] |
− | Labeling typically takes a set of unlabeled data and augments each piece of that unlabeled data with meaningful tags that are informative. [ | + | Labeling typically takes a set of unlabeled data and augments each piece of that unlabeled data with meaningful tags that are informative. [https://en.wikipedia.org/wiki/Labeled_data Wikipedia] |
− | Automation has put low-skill jobs at risk for decades. And self-driving cars, robots, and | + | Automation has put low-skill jobs at risk for decades. And self-driving cars, robots, and [[Speech Recognition]] will continue the trend. But, some experts also see new opportunities in the automated age. ...the curation of data, where you take raw data and you clean it up and you have to kind of organize it for machines to ingest [https://www.techrepublic.com/article/is-data-labeling-the-new-blue-collar-job-of-the-ai-era/ Is 'data labeling' the new blue-collar job of the AI era? | Hope Reese - TechRepublic] |
{|<!-- T --> | {|<!-- T --> | ||
Line 324: | Line 330: | ||
|| | || | ||
<youtube>GAZ9s2Shzxw</youtube> | <youtube>GAZ9s2Shzxw</youtube> | ||
− | <b>How 'AI Farms' Are At The Forefront Of China's Global Ambitions | TIME | + | <b>How 'AI Farms' Are At The Forefront Of [[Government Services#China|China]]'s Global Ambitions | TIME |
− | </b><br>As | + | </b><br>As [[Government Services#China|China]]’s economy slows and rising wages make manufacturing less competitive, the ruling [[Government Services#China|Chinese]] Communist Party (CCP) is turning to technology to arrest the slide. Strategic technologies such as AI are a key focus. |
− | Subscribe to TIME | + | Subscribe to TIME https://po.st/SubscribeTIME |
|} | |} | ||
|<!-- M --> | |<!-- M --> | ||
Line 333: | Line 339: | ||
|| | || | ||
<youtube>tMZgRTQ-hv4</youtube> | <youtube>tMZgRTQ-hv4</youtube> | ||
− | <b> | + | <b>[[Government Services#China|China]]’s Big AI Advantage: Humans |
− | </b><br>Seemingly “intelligent” devices like self-driving trucks aren’t actually all that intelligent. In order to avoid plowing into other cars or making illegal lane changes, they need a lot of help. In China, that help is increasingly coming from rooms full of college students. Li Zhenwei is a data labeler. His job, which didn’t even exist a few years ago, involves sitting at a computer, clicking frame-by-frame through endless hours of dashcam footage, and drawing lines over each photo to help the computer recognize lane markers. “Every good-looking field has people working behind the scenes,” says Li. “I'd prefer to be an anonymous hero.” | + | </b><br>Seemingly “intelligent” devices like self-driving trucks aren’t actually all that intelligent. In order to avoid plowing into other cars or making illegal lane changes, they need a lot of help. In [[Government Services#China|China]], that help is increasingly coming from rooms full of college students. Li Zhenwei is a data labeler. His job, which didn’t even exist a few years ago, involves sitting at a computer, clicking frame-by-frame through endless hours of dashcam footage, and drawing lines over each photo to help the computer recognize lane markers. “Every good-looking field has people working behind the scenes,” says Li. “I'd prefer to be an anonymous hero.” |
|} | |} | ||
|}<!-- B --> | |}<!-- B --> | ||
− | * [ | + | * [https://www.kdnuggets.com/2017/06/acquiring-quality-labeled-training-data.html 7 Ways to Get High-Quality Labeled Training Data at Low Cost | James Kobielus - KDnuggets] |
− | * [ | + | * [https://www.kdnuggets.com/2018/05/data-labeling-machine-learning.html How to Organize Data Labeling for Machine Learning: Approaches and Tools | AltexSoft KDnuggets] |
− | + | https://www.altexsoft.com/media/2018/03/Screenshot_labeling_final2.png | |
=== <span id="Auto-tagging"></span>Auto-tagging === | === <span id="Auto-tagging"></span>Auto-tagging === | ||
− | [ | + | [https://www.youtube.com/results?search_query=Auto+tagging+deep+machine+learning+ML Youtube search...] |
− | [ | + | [https://www.google.com/search?q=Auto+tagging+deep+machine+learning+ML ...Google search] |
* [[...predict categories]] (classification) | * [[...predict categories]] (classification) | ||
Line 352: | Line 358: | ||
** [[SharePoint]] | ** [[SharePoint]] | ||
* [[Natural Language Tools & Services]] for Text labeling | * [[Natural Language Tools & Services]] for Text labeling | ||
− | * [ | + | * [https://www.kdnuggets.com/2018/05/data-labeling-machine-learning.html Image and video labeling:] |
− | ** [ | + | ** [https://annotorious.github.io/ Annotorious] the MIT-licensed free web image annotation and labeling tool. It allows for adding text comments and drawings to images on a website. The tool can be easily integrated with only two lines of additional code. |
− | ** [ | + | ** [https://openseadragon.github.io/ OpenSeadragon] An open-source, web-based viewer for high-resolution zoomable images, implemented in pure JavaScript, for desktop and mobile. |
− | ** [ | + | ** [https://labelme2.csail.mit.edu/Release3.0/index.php LabelMe] open online tool. Software must assist users in building image databases for computer vision research, its developers note. Users can also download the MATLAB toolbox that is designed for working with images in the LabelMe public dataset. |
− | ** [ | + | ** [https://cvhci.anthropomatik.kit.edu/~baeuml/projects/a-universal-labeling-tool-for-computer-vision-sloth/ Sloth] allows users to label image and video files for computer vision research. Face recognition is one of Sloth’s common use cases. |
− | ** [ | + | ** [https://github.com/Microsoft/VoTTVisual Object Tagging Tool (VoTT)] labeling is one of the model [[development]] stages that VoTT supports. This tool also allows data scientists to train and validate object detection models. |
− | ** [ | + | ** [https://labelbox.com/ Labelbox] build computer vision products for the real world. A complete solution for your training data problem with fast labeling tools, human workforce, data management, a powerful API and automation features. |
− | ** [ | + | ** [https://alpslabel.wordpress.com/2017/01/26/alt/ Alp’s Labeling Tool] macro code allows easy labeling of images, and creates text files compatible with Detectnet / KITTI dataset format. |
− | ** [ | + | ** [https://github.com/davisking/dlib/tree/master/tools/imglab imglab] graphical tool for annotating images with object bounding boxes and optionally their part locations. Generally, you use it when you want to train an object detector (e.g. a face detector) since it allows you to easily create the needed training dataset. |
− | ** [ | + | ** [https://www.robots.ox.ac.uk/~vgg/software/via/ VGG Image Annotator (VIA)] simple and standalone manual annotation software for image, audio and video |
− | ** [ | + | ** [https://wordpress.org/plugins/demon-image-annotation/ Demon image annotation plugin] allows you to add textual annotations to images by select a region of the image and then attach a textual description, the concept of annotating images with user comments. Integration with JQuery Image Annotation |
− | ** [ | + | ** [https://github.com/christopher5106/FastAnnotationTool FastAnnotationTool (FIAT)] enables image data annotation, data augmentation, data extraction, and result visualisation/validation. |
− | ** [ | + | ** [https://rectlabel.com/ RectLabel] an image annotation tool to label images for bounding box object detection and segmentation. |
− | * [ | + | * [https://www.kdnuggets.com/2018/05/data-labeling-machine-learning.html Audio labeling:] |
− | ** [ | + | ** [[Discriminative vs. Generative#Snorkel|Snorkel]] a [[Python library]] to help you label data for supervised machine learning tasks |
− | ** [ | + | ** [https://www.fon.hum.uva.nl/praat/ Praat] free software for labeling audio files, mark timepoints of events in the audio file and annotate these events with text labels in a lightweight and portable TextGrid file. |
− | ** [ | + | ** [https://github.com/felixbur/Speechalyzer Speechalyzer] a tool for the daily work of a 'speech worker'. It is optimized to process large speech data sets with respect to transcription, labeling and annotation. |
+ | ** [https://github.com/ritazh/EchoML EchoML] tool for audio file annotation. It allows users to visualize their data. | ||
{|<!-- T --> | {|<!-- T --> | ||
Line 407: | Line 414: | ||
==== <span id="Synthetic Labeling"></span>Synthetic Labeling ==== | ==== <span id="Synthetic Labeling"></span>Synthetic Labeling ==== | ||
− | * [[Generative]] | + | * [[Generative AI]] |
− | This approach entails generating data that imitates real data in terms of essential parameters set by a user. Synthetic data is produced by a generative model that is trained and validated on an original dataset. There are three types of generative models: (1) [[Generative Adversarial Network (GAN)]]; generative/discriminative, (2) [[Autoregressive]] models (ARs); previous values, and (3) [[Variational Autoencoder (VAE)]]; encoding/decoding. | + | This approach entails generating data that imitates real data in terms of essential parameters set by a user. Synthetic data is produced by a [[Generative AI|generative]] model that is trained and validated on an original dataset. There are three types of [[Generative AI|generative]] models: (1) [[Generative Adversarial Network (GAN)]]; [[Generative AI|generative]]/discriminative, (2) [[Autoregressive]] models (ARs); previous values, and (3) [[Variational Autoencoder (VAE)]]; encoding/decoding. |
{|<!-- T --> | {|<!-- T --> | ||
Line 425: | Line 432: | ||
<youtube>riT9KTkBj0E</youtube> | <youtube>riT9KTkBj0E</youtube> | ||
<b>PyCon.DE 2017 Hendrik Niemeyer - Synthetic Data for Machine Learning Applications | <b>PyCon.DE 2017 Hendrik Niemeyer - Synthetic Data for Machine Learning Applications | ||
− | </b><br>Dr. Hendrik Niemeyer (@hniemeye) Data Scientist working on | + | </b><br>Dr. Hendrik Niemeyer (@hniemeye) Data Scientist working on [[Predictive Analytics]] with data from pipeline inspection measurements. |
− | Tags: data-science [[Python]] machine learning ai In this talk I will show how we use real and synthetic data to create successful models for risk assessing pipeline anomalies. The main focus is the estimation of the difference in the statistical properties of real and generated data by machine learning methods. ROSEN provides | + | Tags: data-science [[Python]] machine learning ai In this talk I will show how we use real and synthetic data to create successful models for risk assessing pipeline anomalies. The main focus is the estimation of the difference in the statistical properties of real and generated data by machine learning methods. ROSEN provides [[Predictive Analytics]] for pipelines by detecting and risk assessing anomalies from data gathered by inline inspection measurement devices. Due to budget reasons (pipelines need to be dug up to get acess) ground truth data for machine learning applications in this field are usually scarce, imbalanced and not available for all existing configurations of measurement devices. This creates the need for synthetic data (using FEM simulations and unsupervised learning algorithms) in order to be able to create successful models. But a naive mixture of real-world and synthetic samples in a model does not necessarily yield to an increased predictive performance because of differences in the statistical distributions in feature space. I will show how we evaluate the use of synthetic data besides simple visual inspection. Manifold learning (e.g. TSNE) can be used to gain an insight whether real and generated data are inherently different. Quantitative approaches like classifiers trained to discriminate between these types of data provide a non visual insight whether a "synthetic gap" in the feature distributions exists. If the synthetic data is useful for model building careful considerations have to be applied when constructing cross validation folds and test sets to prevent biased estimates of the model performance. Recorded at PyCon.DE 2017 Karlsruhe: pycon.de Video editing: Sebastian Neubauer & Andrei Dan Tools: Blender, Avidemux & Sonic Pi |
|} | |} | ||
|}<!-- B --> | |}<!-- B --> | ||
Line 432: | Line 439: | ||
= <span id="Batch Norm(alization) & Standardization"></span>Batch Norm(alization) & Standardization = | = <span id="Batch Norm(alization) & Standardization"></span>Batch Norm(alization) & Standardization = | ||
− | [ | + | [https://www.youtube.com/results?search_query=batch+norm+Normalization+standardize+standard+data+set Youtube search...] |
− | [ | + | [https://www.google.com/search?q=batch+norm+Normalization+standardize+standard+data+set+deep+machine+learning+ML ...Google search] |
− | * [ | + | * [https://mlexplained.com/2018/11/30/an-overview-of-normalization-methods-in-deep-learning/ An Overview of Normalization Methods in Deep Learning | keitakurita] |
− | To use numeric data for machine regression, you usually need to normalize the data. Otherwise, the numbers with larger ranges may tend to dominate the Euclidian distance between feature vectors, their effects can be magnified at the expense of the other fields, and the steepest descent optimization may have difficulty converging. There are a number of ways to normalize and standardize data for ML, including min-max normalization, mean normalization, standardization, and scaling to unit length. This process is often called feature scaling. [ | + | To use numeric data for machine regression, you usually need to normalize the data. Otherwise, the numbers with larger ranges may tend to dominate the Euclidian distance between feature vectors, their effects can be magnified at the expense of the other fields, and the steepest descent optimization may have difficulty converging. There are a number of ways to normalize and standardize data for ML, including min-max normalization, mean normalization, standardization, and scaling to unit length. This process is often called feature scaling. [https://www.infoworld.com/article/3394399/machine-learning-algorithms-explained.html Machine learning algorithms explained | Martin Heller - InfoWorld] |
− | When feeding data into a machine learning model, the data should usually be "normalized". This means scaling the data so that it has a mean and standard deviation within "reasonable" limits. This is to ensure the objective functions in the machine learning model will work as expected and not focus on a specific feature of the input data. Without normalizing inputs the model may be extremely fragile. Batch normalization is an extension of this concept. Instead of just normalizing the data at the input to the neural network, batch normalization adds layers to allow normalization to occur at the input to each convolutional layer. [ | + | When feeding data into a machine learning model, the data should usually be "normalized". This means scaling the data so that it has a mean and standard deviation within "reasonable" limits. This is to ensure the objective functions in the machine learning model will work as expected and not focus on a specific feature of the input data. Without normalizing inputs the model may be extremely fragile. Batch normalization is an extension of this concept. Instead of just normalizing the data at the input to the neural network, batch normalization adds layers to allow normalization to occur at the input to each convolutional layer. [https://wiki.fast.ai/index.php/Over-fitting | Deep Learning Course Wiki] |
Batch Norm is a normalization method that normalizes [[Activation Functions]] in a network across the mini-batch. For each feature, batch normalization computes the mean and variance of that feature in the mini-batch. It then subtracts the mean and divides the feature by its mini-batch standard deviation. | Batch Norm is a normalization method that normalizes [[Activation Functions]] in a network across the mini-batch. For each feature, batch normalization computes the mean and variance of that feature in the mini-batch. It then subtracts the mean and divides the feature by its mini-batch standard deviation. | ||
Line 491: | Line 498: | ||
<youtube>nUUqwaxLnWs</youtube> | <youtube>nUUqwaxLnWs</youtube> | ||
<b>Why Does Batch Norm Work? (C2W3L06) | <b>Why Does Batch Norm Work? (C2W3L06) | ||
− | </b><br>Take the Deep Learning Specialization: | + | </b><br>Take the Deep Learning Specialization: https://bit.ly/2x614g3 Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch |
|} | |} | ||
|<!-- M --> | |<!-- M --> | ||
Line 499: | Line 506: | ||
<youtube>em6dfRxYkYU</youtube> | <youtube>em6dfRxYkYU</youtube> | ||
<b>Fitting Batch Norm Into Neural Networks (C2W3L05) | <b>Fitting Batch Norm Into Neural Networks (C2W3L05) | ||
− | </b><br>Take the Deep Learning Specialization: | + | </b><br>Take the Deep Learning Specialization: https://bit.ly/2vAwCKt Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch |
|} | |} | ||
|}<!-- B --> | |}<!-- B --> | ||
= <span id="Data Completeness"></span>Data Completeness = | = <span id="Data Completeness"></span>Data Completeness = | ||
− | [ | + | [https://www.youtube.com/results?search_query=Completeness+data+AI+deep+machine+learning Youtube search...] |
− | [ | + | [https://www.google.com/search?q=Completeness+data+AI+deep+machine+learning ...Google search] |
{|<!-- T --> | {|<!-- T --> | ||
Line 526: | Line 533: | ||
= <span id="Zero Padding"></span>Zero Padding = | = <span id="Zero Padding"></span>Zero Padding = | ||
− | [ | + | [https://www.youtube.com/results?search_query=zero+padding+Dimensional+Reduction+Algorithm Youtube search...] |
− | [ | + | [https://www.google.com/search?q=zero+padding+Dimensional+Reduction+Algorithm+machine+learning+ML+artificial+intelligence ...Google search] |
* [[Pooling / Sub-sampling: Max, Mean]] | * [[Pooling / Sub-sampling: Max, Mean]] | ||
Line 541: | Line 548: | ||
<youtube>qSTv_m-KFk0</youtube> | <youtube>qSTv_m-KFk0</youtube> | ||
<b>Zero Padding in Convolutional Neural Networks explained | <b>Zero Padding in Convolutional Neural Networks explained | ||
− | </b><br>Let's start out by explaining the motivation for zero padding and then we get into the details about what zero padding actually is. We then talk about the types of issues we may run into if we don’t use zero padding, and then we see how we can implement zero padding in code using [[Keras]]. We build on some of the ideas that we discussed in our video on Convolutional Neural Networks, so if you haven’t seen that yet, go ahead and check it out, and then come back to watch this video once you’ve finished up there. | + | </b><br>Let's start out by explaining the motivation for zero padding and then we get into the details about what zero padding actually is. We then talk about the types of issues we may run into if we don’t use zero padding, and then we see how we can implement zero padding in code using [[Keras]]. We build on some of the ideas that we discussed in our video on Convolutional Neural Networks, so if you haven’t seen that yet, go ahead and check it out, and then come back to watch this video once you’ve finished up there. https://youtu.be/YRhxdVk_sIs |
|} | |} | ||
|<!-- M --> | |<!-- M --> | ||
Line 549: | Line 556: | ||
<youtube>smHa2442Ah4</youtube> | <youtube>smHa2442Ah4</youtube> | ||
<b>C4W1L04 Padding | <b>C4W1L04 Padding | ||
− | </b><br>Take the Deep Learning Specialization: | + | </b><br>Take the Deep Learning Specialization: https://bit.ly/330te8c Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch |
|} | |} | ||
|}<!-- B --> | |}<!-- B --> | ||
= <span id="Imbalanced Data"></span>Imbalanced Data = | = <span id="Imbalanced Data"></span>Imbalanced Data = | ||
− | [ | + | [https://www.youtube.com/results?search_query=Undersampling+Technique+deep+machine+learning+ML+artificial+intelligence Youtube search...] |
− | [ | + | [https://www.google.com/search?q=Undersampling+Technique+deep+machine+learning+ML+artificial+intelligence ...Google search] |
− | What is imbalanced data? The definition of imbalanced data is straightforward. A dataset is imbalanced if at least one of the classes constitutes only a very small minority. Imbalanced data prevail in banking, insurance, engineering, and many other fields. It is common in fraud detection that the imbalance is on the order of 100 to 1. ... The issue of class imbalance can result in a serious bias towards the majority class, reducing the classification performance and increasing the number of false negatives. How can we alleviate the issue? The most commonly used techniques are data resampling either under-sampling the majority of the class, or over-sampling the minority class, or a mix of both. [ | + | What is imbalanced data? The definition of imbalanced data is straightforward. A dataset is imbalanced if at least one of the classes constitutes only a very small minority. Imbalanced data prevail in banking, insurance, engineering, and many other fields. It is common in fraud detection that the imbalance is on the order of 100 to 1. ... The issue of class imbalance can result in a serious bias towards the majority class, reducing the classification performance and increasing the number of false negatives. How can we alleviate the issue? The most commonly used techniques are data resampling either under-sampling the majority of the class, or over-sampling the minority class, or a mix of both. [https://towardsdatascience.com/@Dataman.ai | Dataman] |
Classification algorithms tend to perform poorly when data is skewed towards one class, as is often the case when tackling real-world problems such as fraud detection or medical diagnosis. A range of methods exist for addressing this problem, including re-sampling, one-class learning and cost-sensitive learning. | Natalie Hockham | Classification algorithms tend to perform poorly when data is skewed towards one class, as is often the case when tackling real-world problems such as fraud detection or medical diagnosis. A range of methods exist for addressing this problem, including re-sampling, one-class learning and cost-sensitive learning. | Natalie Hockham | ||
Line 567: | Line 574: | ||
<youtube>X9MZtvvQDR4</youtube> | <youtube>X9MZtvvQDR4</youtube> | ||
<b>Natalie Hockham: Machine learning with imbalanced data sets | <b>Natalie Hockham: Machine learning with imbalanced data sets | ||
− | </b><br>Classification algorithms tend to perform poorly when data is skewed towards one class, as is often the case when tackling real-world problems such as fraud detection or medical diagnosis. A range of methods exist for addressing this problem, including re-sampling, one-class learning and cost-sensitive learning. This talk looks at these different approaches in the context of fraud detection. | + | </b><br>Classification algorithms tend to perform poorly when data is skewed towards one class, as is often the case when tackling real-world problems such as fraud detection or medical diagnosis. A range of methods exist for addressing this problem, including re-sampling, one-class learning and cost-sensitive learning. This talk looks at these different approaches in the [[context]] of fraud detection. |
|} | |} | ||
|<!-- M --> | |<!-- M --> | ||
Line 616: | Line 623: | ||
== <span id="Under-sampling"></span>Under-sampling == | == <span id="Under-sampling"></span>Under-sampling == | ||
− | * [ | + | * [https://towardsdatascience.com/sampling-techniques-for-extremely-imbalanced-data-part-i-under-sampling-a8dbc3d8d6d8 Using Under-Sampling Techniques for Extremely Imbalanced Data | Dataman - Towards Data Science] |
− | + | https://miro.medium.com/max/335/1*YH_vPYQEDIW0JoUYMeLz_A.png | |
{|<!-- T --> | {|<!-- T --> | ||
Line 639: | Line 646: | ||
== <span id="Over-sampling"></span>Over-sampling == | == <span id="Over-sampling"></span>Over-sampling == | ||
− | * [ | + | * [https://towardsdatascience.com/sampling-techniques-for-extremely-imbalanced-data-part-ii-over-sampling-d61b43bc4879 Using Over-Sampling Techniques for Extremely Imbalanced Data | Dataman - Towards Data Science] |
− | + | https://miro.medium.com/max/375/1*aKJJOozIlVVH1gT-4rYy4w.png | |
{|<!-- T --> | {|<!-- T --> | ||
Line 662: | Line 669: | ||
= <span id="Skewed Data"></span>Skewed Data = | = <span id="Skewed Data"></span>Skewed Data = | ||
− | [ | + | [https://www.youtube.com/results?search_query=Skewed+Data+artificial+intelligence Youtube search...] |
− | [ | + | [https://www.google.com/search?q="Skewed+Data"+artificial+intelligence ...Google search] |
{|<!-- T --> | {|<!-- T --> | ||
Line 696: | Line 703: | ||
<youtube>6zg7NTw-kTQ</youtube> | <youtube>6zg7NTw-kTQ</youtube> | ||
<b>Working with Skewed Data: The Iterative Broadcast - Rob Keevil & Fokko Driesprong | <b>Working with Skewed Data: The Iterative Broadcast - Rob Keevil & Fokko Driesprong | ||
− | </b><br>"Skewed data is the enemy when joining tables using Spark. It shuffles a large proportion of the data onto a few overloaded nodes, bottlenecking Spark's parallelism and resulting in out of memory errors. The go-to answer is to use broadcast joins; leaving the large, skewed dataset in place and transmitting a smaller table to every machine in the cluster for joining. But what happens when your second table is too large to broadcast, and does not fit into memory? Or even worse, when a single key is bigger than the total size of your executor? Firstly, we will give an introduction into the problem. Secondly, the current ways of fighting the problem will be explained, including why these solutions are limited. Finally, we will demonstrate a new technique - the iterative broadcast join - developed while processing ING Bank's global transaction data. This technique, implemented on top of the Spark SQL API, allows multiple large and highly skewed datasets to be joined successfully, while retaining a high level of parallelism. This is something that is not possible with existing Spark join types. Session hashtag: #EUde11" About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business. Website: | + | </b><br>"Skewed data is the enemy when joining tables using Spark. It shuffles a large proportion of the data onto a few overloaded nodes, bottlenecking Spark's parallelism and resulting in out of [[memory]] errors. The go-to answer is to use broadcast joins; leaving the large, skewed dataset in place and transmitting a smaller table to every machine in the cluster for joining. But what happens when your second table is too large to broadcast, and does not fit into [[memory]]? Or even worse, when a single key is bigger than the total size of your executor? Firstly, we will give an introduction into the problem. Secondly, the current ways of fighting the problem will be explained, including why these solutions are limited. Finally, we will demonstrate a new technique - the iterative broadcast join - developed while processing ING Bank's global transaction data. This technique, implemented on top of the Spark SQL API, allows multiple large and highly skewed datasets to be joined successfully, while retaining a high level of parallelism. This is something that is not possible with existing Spark join types. Session hashtag: #EUde11" About: [[Databricks]] provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business. Website: https://databricks.com |
|} | |} | ||
|}<!-- B --> | |}<!-- B --> | ||
Line 702: | Line 709: | ||
= <span id="Data Consistency"></span>Data Consistency = | = <span id="Data Consistency"></span>Data Consistency = | ||
− | [ | + | [https://www.youtube.com/results?search_query=data+Consistency+artificial+intelligence+deep+machine+learning YouTube search...] |
− | [ | + | [https://www.google.com/search?q=data+Consistency+artificial+intelligence+deep+machine+learning ...Google search] |
− | * [ | + | * [https://www.magazine-industry-usa.com/news/27737-consistent-data-is-key-to-ai-process-optimization Consistent Data Is Key to Ai Process Optimization |] [https://dataprophet.com/ DataProfit] |
− | * [ | + | * [https://outlier.ai/data-driven-daily/dirty-data-data-consistency/ Dirty Data: Data Consistency | Outlier] |
* [https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.317.2024&rep=rep1&type=pdf Improving Data Quality: Consistency and Accuracy | G.Cong, W. Fan, F. Geerts, X. Jia, and S. Ma] | * [https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.317.2024&rep=rep1&type=pdf Improving Data Quality: Consistency and Accuracy | G.Cong, W. Fan, F. Geerts, X. Jia, and S. Ma] | ||
Latest revision as of 21:00, 26 April 2024
YouTube ... Quora ...Google search ...Google News ...Bing News
- Data Quality ...validity, accuracy, cleaning, completeness, consistency, encoding, padding, augmentation, labeling, auto-tagging, normalization, standardization, and imbalanced data
- Data Science ... Governance ... Preprocessing ... Exploration ... Interoperability ... Master Data Management (MDM) ... Bias and Variances ... Benchmarks ... Datasets
- Risk, Compliance and Regulation ... Ethics ... Privacy ... Law ... AI Governance ... AI Verification and Validation
- Managed Vocabularies
- Excel ... Documents ... Database; Vector & Relational ... Graph ... LlamaIndex
- Analytics ... Visualization ... Graphical Tools ... Diagrams & Business Analysis ... Requirements ... Loop ... Bayes ... Network Pattern
- Development ... Notebooks ... AI Pair Programming ... Codeless ... Hugging Face ... AIOps/MLOps ... AIaaS/MLaaS
- Backpropagation ... FFNN ... Forward-Forward ... Activation Functions ...Softmax ... Loss ... Boosting ... Gradient Descent ... Hyperparameter ... Manifold Hypothesis ... PCA
- Strategy & Tactics ... Project Management ... Best Practices ... Checklists ... Project Check-in ... Evaluation ... Measures
- AI Solver ... Algorithms ... Administration ... Model Search ... Discriminative vs. Generative ... Train, Validate, and Test
- Artificial General Intelligence (AGI) to Singularity ... Curious Reasoning ... Emergence ... Moonshots ... Explainable AI ... Automated Learning
- The AI Hierarchy of Needs | Monica Rogati - Hackernoon
- Great Expectations ...helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.
|
|
|
|
Contents
Sourcing Data
YouTube search... ...Google search
- Research methods for business students | M. Saunders, P. Lewis, and A. Thornhill
- 10 Data Sourcing Best Practices for Reporting | SolveXia
- Do’s and Do Not’s of Data Sourcing | Synthio
|
|
Data Cleaning
YouTube search... ...Google search
- Data Cleaning Challenge: .json, .txt and .xls | Rachael Tatman
- Machine learning for data cleaning and unification | Abizer Jafferjee - Towards Data Science
- Machine learning algorithms explained | Martin Heller - InfoWorld
- From Messy Files To Automated Analytics | Trifacta
- The Data Prep for AI Toolkit: Smarter ML Models Through Faster, More Accurate Data Prep | Paxata
- The Age of The Badass Analyst | Alteryx
When it comes to utilizing ML data, most of the time is spent on cleaning data sets or creating a dataset that is free of errors. Setting up a quality plan, filling missing values, removing rows, reducing data size are some of the best practices used for data cleaning in Machine Learning. Data Cleaning in Machine Learning: Best Practices and Methods | Smishad Thomas
Overall, incorrect data is either removed, corrected, or imputed... The Ultimate Guide to Data Cleaning | Omar Elgabry - Towards Data Science
- Irrelevant data - are those that are not actually needed, and don’t fit under the context of the problem we’re trying to solve.
- Duplicates - are data points that are repeated in your dataset.
- Type conversion - Make sure numbers are stored as numerical data types. A date should be stored as a date object, or a Unix timestamp (number of seconds), and so on. Categorical values can be converted into and from numbers if needed.
- Syntax errors:
- Remove extra white spaces
- Pad strings - Strings can be padded with spaces or other characters to a certain width
- Fix typos - Strings can be entered in many different ways
- Standardize format
- Scaling / Transformation - scaling means to transform your data so that it fits within a specific scale, such as 0–100 or 0–1.
- Normalization - also rescales the values into a range of 0–1, the intention here is to transform the data so that it is normally distributed.
- Missing values:
- Drop - If the missing values in a column rarely happen and occur at random, then the easiest and most forward solution is to drop observations (rows) that have missing values.
- Impute - It means to calculate the missing value based on other observations.
- Flag
- Outliers - They are values that are significantly different from all other observations...they should not be removed unless there is a good reason for that.
- In-record & cross-datasets errors - result from having two or more values in the same row or across datasets that contradict with each other.
|
|
|
|
Data Encoding
YouTube search... ...Google search
- ...predict categories with classification
- Few Shot Learning
To use categorical data for machine classification, you need to encode the text labels into another form. There are two common encodings.
- One is label encoding, which means that each text label value is replaced with a number.
- The other is one-hot encoding, which means that each text label value is turned into a column with a binary value (1 or 0). Most machine learning frameworks have functions that do the conversion for you. In general, one-hot encoding is preferred, as label encoding can sometimes confuse the machine learning algorithm into thinking that the encoded column is ordered. Machine learning algorithms explained | Martin Heller - InfoWorld
|
|
|
|
Data Augmentation, Data Labeling, and Auto-Tagging
Youtube search... ...Google search
- Data Augmentation | How to use Deep Learning when you have Limited Data | Bharath Raj
- Passenger Screening - How Data Augmentation helped to win
- Tools: Labelbox, Scale AI, Appen, Amazon SageMaker, GoogleAI, Microsoft Azure Machine Learning
- Data Augmentation as a best practice for addressing the Overfitting Challenge
- Scale training and validation data for AI applications. After sending us your data via API call, our platform through a combination of human work and review, smart tools, statistical confidence checks and machine learning checks returns scalable, accurate ground truth data.
- Snorkel AI
- Cloud Factory
- Labelbox
- Scale AI
Data augmentation is the process of using the data you currently have and modifying it in a realistic but randomized way, to increase the variety of data seen during training. As an example for images, slightly rotating, zooming, and/or translating the image will result in the same content, but with a different framing. This is representative of the real-world scenario, so will improve the training. It's worth double-checking that the output of the data augmentation is still realistic. To determine what types of augmentation to use, and how much of it, do some trial and error. Try each augmentation type on a sample set, with a variety of settings (e.g. 1% translation, 5% translation, 10% translation) and see what performs best on the sample set. Once you know the best setting for each augmentation type, try adding them all at the same time. | Deep Learning Course Wiki
Note: In Keras, we can perform transformations using ImageDataGenerator.
|
|
|
|
|
|
What does Data Augmentation mean? | Techopedia
Data augmentation adds value to base data by adding information derived from internal and external sources within an enterprise. Data is one of the core assets for an enterprise, making data management essential. Data augmentation can be applied to any form of data, but may be especially useful for customer data, sales patterns, product sales, where additional information can help provide more in-depth insight. Data augmentation can help reduce the manual intervention required to developed meaningful information and insight of business data, as well as significantly enhance data quality.
Data augmentation is of the last steps done in enterprise data management after monitoring, profiling and integration. Some of the common techniques used in data augmentation include:
- Extrapolation Technique: Based on heuristics. The relevant fields are updated or provided with values.
- Tagging Technique: Common records are tagged to a group, making it easier to understand and differentiate for the group.
- Aggregation Technique: Using mathematical values of averages and means, values are estimated for relevant fields if needed
- Probability Technique: Based on heuristics and analytical statistics, values are populated based on the probability of events.
Data Labeling
Youtube search... ...Google search
- Essential tips for scaling quality AI data labeling | Damian Rochman - VentureBeat
- Four Mistakes You Make When Labeling Data | Tal Perry Towards Data Science
- Building vs. Buying a training data annotation solution | Labelbox
- Data Labeling: Creating Ground Truth | Astasia Myers - Medium
- Tools/Services:
Labeling typically takes a set of unlabeled data and augments each piece of that unlabeled data with meaningful tags that are informative. Wikipedia
Automation has put low-skill jobs at risk for decades. And self-driving cars, robots, and Speech Recognition will continue the trend. But, some experts also see new opportunities in the automated age. ...the curation of data, where you take raw data and you clean it up and you have to kind of organize it for machines to ingest Is 'data labeling' the new blue-collar job of the AI era? | Hope Reese - TechRepublic
|
|
- 7 Ways to Get High-Quality Labeled Training Data at Low Cost | James Kobielus - KDnuggets
- How to Organize Data Labeling for Machine Learning: Approaches and Tools | AltexSoft KDnuggets
Auto-tagging
Youtube search... ...Google search
- ...predict categories (classification)
- Natural Language Processing (NLP)#Summarization / Paraphrasing
- Natural Language Tools & Services for Text labeling
- Image and video labeling:
- Annotorious the MIT-licensed free web image annotation and labeling tool. It allows for adding text comments and drawings to images on a website. The tool can be easily integrated with only two lines of additional code.
- OpenSeadragon An open-source, web-based viewer for high-resolution zoomable images, implemented in pure JavaScript, for desktop and mobile.
- LabelMe open online tool. Software must assist users in building image databases for computer vision research, its developers note. Users can also download the MATLAB toolbox that is designed for working with images in the LabelMe public dataset.
- Sloth allows users to label image and video files for computer vision research. Face recognition is one of Sloth’s common use cases.
- Object Tagging Tool (VoTT) labeling is one of the model development stages that VoTT supports. This tool also allows data scientists to train and validate object detection models.
- Labelbox build computer vision products for the real world. A complete solution for your training data problem with fast labeling tools, human workforce, data management, a powerful API and automation features.
- Alp’s Labeling Tool macro code allows easy labeling of images, and creates text files compatible with Detectnet / KITTI dataset format.
- imglab graphical tool for annotating images with object bounding boxes and optionally their part locations. Generally, you use it when you want to train an object detector (e.g. a face detector) since it allows you to easily create the needed training dataset.
- VGG Image Annotator (VIA) simple and standalone manual annotation software for image, audio and video
- Demon image annotation plugin allows you to add textual annotations to images by select a region of the image and then attach a textual description, the concept of annotating images with user comments. Integration with JQuery Image Annotation
- FastAnnotationTool (FIAT) enables image data annotation, data augmentation, data extraction, and result visualisation/validation.
- RectLabel an image annotation tool to label images for bounding box object detection and segmentation.
- Audio labeling:
- Snorkel a Python library to help you label data for supervised machine learning tasks
- Praat free software for labeling audio files, mark timepoints of events in the audio file and annotate these events with text labels in a lightweight and portable TextGrid file.
- Speechalyzer a tool for the daily work of a 'speech worker'. It is optimized to process large speech data sets with respect to transcription, labeling and annotation.
- EchoML tool for audio file annotation. It allows users to visualize their data.
|
|
|
|
Synthetic Labeling
This approach entails generating data that imitates real data in terms of essential parameters set by a user. Synthetic data is produced by a generative model that is trained and validated on an original dataset. There are three types of generative models: (1) Generative Adversarial Network (GAN); generative/discriminative, (2) Autoregressive models (ARs); previous values, and (3) Variational Autoencoder (VAE); encoding/decoding.
|
|
Batch Norm(alization) & Standardization
Youtube search... ...Google search
To use numeric data for machine regression, you usually need to normalize the data. Otherwise, the numbers with larger ranges may tend to dominate the Euclidian distance between feature vectors, their effects can be magnified at the expense of the other fields, and the steepest descent optimization may have difficulty converging. There are a number of ways to normalize and standardize data for ML, including min-max normalization, mean normalization, standardization, and scaling to unit length. This process is often called feature scaling. Machine learning algorithms explained | Martin Heller - InfoWorld
When feeding data into a machine learning model, the data should usually be "normalized". This means scaling the data so that it has a mean and standard deviation within "reasonable" limits. This is to ensure the objective functions in the machine learning model will work as expected and not focus on a specific feature of the input data. Without normalizing inputs the model may be extremely fragile. Batch normalization is an extension of this concept. Instead of just normalizing the data at the input to the neural network, batch normalization adds layers to allow normalization to occur at the input to each convolutional layer. | Deep Learning Course Wiki
Batch Norm is a normalization method that normalizes Activation Functions in a network across the mini-batch. For each feature, batch normalization computes the mean and variance of that feature in the mini-batch. It then subtracts the mean and divides the feature by its mini-batch standard deviation.
The benefits of using batch normalization (batch norm) are:
- Improves gradient flow through the network
- Allows higher learning rates
- Reduces the strong dependence on initialization
- Acts as a form of regularization
Batch normalization has two elements:
- Normalize the inputs to the layer. This is the same as regular feature scaling or input normalization.
- Add two more trainable parameters. One for a gradient and one for an offset that apply to each of the activations. by adding these parameters, the normalization can effectively be completely undone, using the gradient and offset. This allows the back propagation process to completely ignore the back normalization layer if it wants to.
Good practices for addressing Overfitting Challenge:
- add more data
- use Data Augmentation
- use Batch Normalization
- use architectures that generalize well
- reduce architecture complexity
- add Regularization
- L1 and L2 Regularization - update the general cost function by adding another term known as the regularization term.
- Dropout - at every iteration, it randomly selects some nodes and temporarily removes the nodes (along with all of their incoming and outgoing connections)
- Data Augmentation
- Early Stopping
|
|
|
|
Data Completeness
Youtube search... ...Google search
|
|
Zero Padding
Youtube search... ...Google search
- Pooling / Sub-sampling: Max, Mean
- Softmax
- Dimensional Reduction Algorithms
- (Deep) Convolutional Neural Network (DCNN/CNN)
|
|
Imbalanced Data
Youtube search... ...Google search
What is imbalanced data? The definition of imbalanced data is straightforward. A dataset is imbalanced if at least one of the classes constitutes only a very small minority. Imbalanced data prevail in banking, insurance, engineering, and many other fields. It is common in fraud detection that the imbalance is on the order of 100 to 1. ... The issue of class imbalance can result in a serious bias towards the majority class, reducing the classification performance and increasing the number of false negatives. How can we alleviate the issue? The most commonly used techniques are data resampling either under-sampling the majority of the class, or over-sampling the minority class, or a mix of both. | Dataman
Classification algorithms tend to perform poorly when data is skewed towards one class, as is often the case when tackling real-world problems such as fraud detection or medical diagnosis. A range of methods exist for addressing this problem, including re-sampling, one-class learning and cost-sensitive learning. | Natalie Hockham
|
|
|
|
|
|
Under-sampling
|
|
Over-sampling
|
|
Skewed Data
Youtube search... "Skewed+Data"+artificial+intelligence ...Google search
|
|
|
|
Data Consistency
YouTube search... ...Google search
- Consistent Data Is Key to Ai Process Optimization | DataProfit
- Dirty Data: Data Consistency | Outlier
- Improving Data Quality: Consistency and Accuracy | G.Cong, W. Fan, F. Geerts, X. Jia, and S. Ma
|
|