Difference between revisions of "Data Quality"
m (→Imbalanced Data) |
m |
||
| Line 146: | Line 146: | ||
|| | || | ||
<youtube>9yl6-HEY7_s</youtube> | <youtube>9yl6-HEY7_s</youtube> | ||
| − | <b>Machine Learning Tutorial Python - 6: Dummy Variables & One Hot Encoding | + | <b>Machine Learning Tutorial [[Python]] - 6: Dummy Variables & One Hot Encoding |
</b><br>Machine learning models work very well for dataset having only numbers. But how do we handle text information in dataset? Simple approach is to use interger or label encoding but when categorical variables are nominal, using simple label encoding can be problematic. One hot encoding is the technique that can help in this situation. In this tutorial, we will use pandas get_dummies method to create dummy variables that allows us to perform one hot encoding on given dataset. Alternatively we can use sklearn.preprocessing OneHotEncoder as well to create dummy variables. How to handle text data in machine learning model? Nominal vs Ordinal Variables Theory (Explain one hot encoding using home prices in different townships) Coding (Start) Pandas get_dummies method Create a model that uses dummy columns | </b><br>Machine learning models work very well for dataset having only numbers. But how do we handle text information in dataset? Simple approach is to use interger or label encoding but when categorical variables are nominal, using simple label encoding can be problematic. One hot encoding is the technique that can help in this situation. In this tutorial, we will use pandas get_dummies method to create dummy variables that allows us to perform one hot encoding on given dataset. Alternatively we can use sklearn.preprocessing OneHotEncoder as well to create dummy variables. How to handle text data in machine learning model? Nominal vs Ordinal Variables Theory (Explain one hot encoding using home prices in different townships) Coding (Start) Pandas get_dummies method Create a model that uses dummy columns | ||
Label Encoder fit_transform() method sklearn OneHotEncoder Exercise (To predict prices of car based on car model, age, mileage) | Label Encoder fit_transform() method sklearn OneHotEncoder Exercise (To predict prices of car based on car model, age, mileage) | ||
| Line 155: | Line 155: | ||
|| | || | ||
<youtube>EgtlklP_mwU</youtube> | <youtube>EgtlklP_mwU</youtube> | ||
| − | <b>How To Encode Categorical Data in a CSV Dataset | Python | Machine Learning | + | <b>How To Encode Categorical Data in a CSV Dataset | [[Python]] | Machine Learning |
</b><br>Machine learning models deal with mathematical equations and numbers, so categorical data which are mostly strings needs to encoded in numbers. In this video, we discuss what one-hot encoding is, how this encoding is used in machine learning and artificial neural networks, and what is meant by having one-hot encoded vectors as labels for our input data. Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources Collective Intelligence and the DEEPLIZARD HIVEMIND | </b><br>Machine learning models deal with mathematical equations and numbers, so categorical data which are mostly strings needs to encoded in numbers. In this video, we discuss what one-hot encoding is, how this encoding is used in machine learning and artificial neural networks, and what is meant by having one-hot encoded vectors as labels for our input data. Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources Collective Intelligence and the DEEPLIZARD HIVEMIND | ||
|} | |} | ||
| Line 172: | Line 172: | ||
|| | || | ||
<youtube>EQ7z6LsDe0E</youtube> | <youtube>EQ7z6LsDe0E</youtube> | ||
| − | <b>How to implement One Hot Encoding on Categorical Data | Dummy Encoding | Machine Learning | Python | + | <b>How to implement One Hot Encoding on Categorical Data | Dummy Encoding | Machine Learning | [[Python]] |
</b><br>Label encoding encodes categories to numbers in a data set that might lead to comparisons between the data , to avoid that we use one hot encoding | </b><br>Label encoding encodes categories to numbers in a data set that might lead to comparisons between the data , to avoid that we use one hot encoding | ||
|} | |} | ||
| Line 388: | Line 388: | ||
<b>PyCon.DE 2017 Hendrik Niemeyer - Synthetic Data for Machine Learning Applications | <b>PyCon.DE 2017 Hendrik Niemeyer - Synthetic Data for Machine Learning Applications | ||
</b><br>Dr. Hendrik Niemeyer (@hniemeye) Data Scientist working on predictive analytics with data from pipeline inspection measurements. | </b><br>Dr. Hendrik Niemeyer (@hniemeye) Data Scientist working on predictive analytics with data from pipeline inspection measurements. | ||
| − | Tags: data-science | + | Tags: data-science [[Python]] machine learning ai In this talk I will show how we use real and synthetic data to create successful models for risk assessing pipeline anomalies. The main focus is the estimation of the difference in the statistical properties of real and generated data by machine learning methods. ROSEN provides predictive analytics for pipelines by detecting and risk assessing anomalies from data gathered by inline inspection measurement devices. Due to budget reasons (pipelines need to be dug up to get acess) ground truth data for machine learning applications in this field are usually scarce, imbalanced and not available for all existing configurations of measurement devices. This creates the need for synthetic data (using FEM simulations and unsupervised learning algorithms) in order to be able to create successful models. But a naive mixture of real-world and synthetic samples in a model does not necessarily yield to an increased predictive performance because of differences in the statistical distributions in feature space. I will show how we evaluate the use of synthetic data besides simple visual inspection. Manifold learning (e.g. TSNE) can be used to gain an insight whether real and generated data are inherently different. Quantitative approaches like classifiers trained to discriminate between these types of data provide a non visual insight whether a "synthetic gap" in the feature distributions exists. If the synthetic data is useful for model building careful considerations have to be applied when constructing cross validation folds and test sets to prevent biased estimates of the model performance. Recorded at PyCon.DE 2017 Karlsruhe: pycon.de Video editing: Sebastian Neubauer & Andrei Dan Tools: Blender, Avidemux & Sonic Pi |
|} | |} | ||
|}<!-- B --> | |}<!-- B --> | ||
| Line 514: | Line 514: | ||
|| | || | ||
<youtube>DQC_YE3I5ig</youtube> | <youtube>DQC_YE3I5ig</youtube> | ||
| − | <b>Machine Learning - Over-& Undersampling - Python/ Scikit/ Scikit-Imblearn | + | <b>Machine Learning - Over-& Undersampling - [[Python]]/ Scikit/ Scikit-Imblearn |
| − | </b><br>In this video I will explain you how to use Over- & Undersampling with machine learning using | + | </b><br>In this video I will explain you how to use Over- & Undersampling with machine learning using [[Python]], scikit and scikit-imblearn. The concepts shown in this video will show you what Over-and Undersampling is and how to correctly use it even when cross-validating. So let's go! |
| − | The concepts shown in this video will show you what Over-and Undersampling is and how to correctly use it even when cross-validating. So let's go! | ||
|} | |} | ||
|}<!-- B --> | |}<!-- B --> | ||
| Line 566: | Line 565: | ||
|| | || | ||
<youtube>Ti8SbfFecuc</youtube> | <youtube>Ti8SbfFecuc</youtube> | ||
| − | <b>Undersampling for Handling Imbalanced Datasets | Python | Machine Learning | + | <b>Undersampling for Handling Imbalanced Datasets | [[Python]] | Machine Learning |
</b><br>Whenever we do classification in ML, we often assume that target label is evenly distributed in our dataset. This helps the training algorithm to learn the features as we have enough examples for all the different cases. For example, in learning a spam filter, we should have good amount of data which corresponds to emails which are spam and non spam. This even distribution is not always possible. I'll discuss one of the techniques known as Undersampling that helps us tackle this issue. Undersampling is one of the techniques used for handling class imbalance. In this technique, we under sample majority class to match the minority class. | </b><br>Whenever we do classification in ML, we often assume that target label is evenly distributed in our dataset. This helps the training algorithm to learn the features as we have enough examples for all the different cases. For example, in learning a spam filter, we should have good amount of data which corresponds to emails which are spam and non spam. This even distribution is not always possible. I'll discuss one of the techniques known as Undersampling that helps us tackle this issue. Undersampling is one of the techniques used for handling class imbalance. In this technique, we under sample majority class to match the minority class. | ||
|} | |} | ||
| Line 574: | Line 573: | ||
|| | || | ||
<youtube>YMPMZmlH5Bo</youtube> | <youtube>YMPMZmlH5Bo</youtube> | ||
| − | <b>Tutorial 45-Handling imbalanced Dataset using | + | <b>Tutorial 45-Handling imbalanced Dataset using [[Python]] - Part 1 |
</b><br>Machine Learning algorithms tend to produce unsatisfactory classifiers when faced with imbalanced datasets. For any imbalanced data set, if the event to be predicted belongs to the minority class and the event rate is less than 5%, it is usually referred to as a rare even. | </b><br>Machine Learning algorithms tend to produce unsatisfactory classifiers when faced with imbalanced datasets. For any imbalanced data set, if the event to be predicted belongs to the minority class and the event rate is less than 5%, it is usually referred to as a rare even. | ||
|} | |} | ||
| Line 629: | Line 628: | ||
<youtube>YCwRd-N3D14</youtube> | <youtube>YCwRd-N3D14</youtube> | ||
<b>Log Transformation for Outliers | Convert Skewed data to Normal Distribution | <b>Log Transformation for Outliers | Convert Skewed data to Normal Distribution | ||
| − | </b><br>This video titled "Log Transformation for Outliers | Convert Skewed data to Normal Distribution" explains how to use Log Transformation for treating Outliers as well as using Log Transformation for Converting Positive Skewed data to Normal Distribution form. Code example in Python is also covered in the video. This is a machine learning & deep learning Bootcamp series of data science. You will also get some flavor of data engineering as well in this Bootcamp series. Through this series, you will be able to learn each aspect of the Data science lifecycle right from collecting data from disparate data sources, data preprocessing to doing visualization as well as model deployment in production. You will also see how to perform data preprocessing and build, regression, classification, clustering as well as a recurrent neural network, convolution neural network, autoencoders, etc. Through this series, you will be able to learn everything pertaining to Machine and Deep Learning in one place. Content & Playlist will be updated regularly to add videos with new topics. | + | </b><br>This video titled "Log Transformation for Outliers | Convert Skewed data to Normal Distribution" explains how to use Log Transformation for treating Outliers as well as using Log Transformation for Converting Positive Skewed data to Normal Distribution form. Code example in [[Python]] is also covered in the video. This is a machine learning & deep learning Bootcamp series of data science. You will also get some flavor of data engineering as well in this Bootcamp series. Through this series, you will be able to learn each aspect of the Data science lifecycle right from collecting data from disparate data sources, data preprocessing to doing visualization as well as model deployment in production. You will also see how to perform data preprocessing and build, regression, classification, clustering as well as a recurrent neural network, convolution neural network, autoencoders, etc. Through this series, you will be able to learn everything pertaining to Machine and Deep Learning in one place. Content & Playlist will be updated regularly to add videos with new topics. |
|} | |} | ||
|<!-- M --> | |<!-- M --> | ||
Revision as of 21:17, 20 September 2020
YouTube search... Quora search... ...Google search
- AI Governance
- Data Science / Data Governance
- Benchmarks
- Data Preprocessing
- Feature Exploration/Learning ...inspection, data profiling, selection
- Data Quality ...validity, accuracy, cleaning, completeness, consistency, uniformity, encoding, padding, augmentation, labeling, auto-tagging, normalization, standardization, and imbalanced data
- Bias and Variances
- Master Data Management (MDM) / Feature Store / Data Lineage / Data Catalog
- Privacy in Data Science
- Data Interoperability
- Excel - Data Analysis
- Data Science / Data Governance
- Visualization
- Hyperparameters
- Evaluation
- Train, Validate, and Test
- Automated Machine Learning (AML) - AutoML
- Great Expectations ...helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.
|
|
|
|
Contents
Data Cleaning
YouTube search... ...Google search
- Data Cleaning Challenge: .json, .txt and .xls | Rachael Tatman
- Machine learning for data cleaning and unification | Abizer Jafferjee - Towards Data Science
- Machine learning algorithms explained | Martin Heller - InfoWorld
When it comes to utilizing ML data, most of the time is spent on cleaning data sets or creating a dataset that is free of errors. Setting up a quality plan, filling missing values, removing rows, reducing data size are some of the best practices used for data cleaning in Machine Learning. Data Cleaning in Machine Learning: Best Practices and Methods | Smishad Thomas
Overall, incorrect data is either removed, corrected, or imputed... The Ultimate Guide to Data Cleaning | Omar Elgabry - Towards Data Science
- Irrelevant data - are those that are not actually needed, and don’t fit under the context of the problem we’re trying to solve.
- Duplicates - are data points that are repeated in your dataset.
- Type conversion - Make sure numbers are stored as numerical data types. A date should be stored as a date object, or a Unix timestamp (number of seconds), and so on. Categorical values can be converted into and from numbers if needed.
- Syntax errors:
- Remove extra white spaces
- Pad strings - Strings can be padded with spaces or other characters to a certain width
- Fix typos - Strings can be entered in many different ways
- Standardize format
- Scaling / Transformation - scaling means to transform your data so that it fits within a specific scale, such as 0–100 or 0–1.
- Normalization - also rescales the values into a range of 0–1, the intention here is to transform the data so that it is normally distributed.
- Missing values:
- Drop - If the missing values in a column rarely happen and occur at random, then the easiest and most forward solution is to drop observations (rows) that have missing values.
- Impute - It means to calculate the missing value based on other observations.
- Flag
- Outliers - They are values that are significantly different from all other observations...they should not be removed unless there is a good reason for that.
- In-record & cross-datasets errors - result from having two or more values in the same row or across datasets that contradict with each other.
|
|
|
|
Data Encoding
YouTube search... ...Google search
- ...predict categories with classification
- Few Shot Learning
To use categorical data for machine classification, you need to encode the text labels into another form. There are two common encodings.
- One is label encoding, which means that each text label value is replaced with a number.
- The other is one-hot encoding, which means that each text label value is turned into a column with a binary value (1 or 0). Most machine learning frameworks have functions that do the conversion for you. In general, one-hot encoding is preferred, as label encoding can sometimes confuse the machine learning algorithm into thinking that the encoded column is ordered. Machine learning algorithms explained | Martin Heller - InfoWorld
|
|
|
|
Data Augmentation, Data Labeling, and Auto-Tagging
Youtube search... ...Google search
- Data Augmentation | How to use Deep Learning when you have Limited Data | Bharath Raj
- Passenger Screening - How Data Augmentation helped to win
- Tools: Scale, Labelbox, FigureEight, Amazon SageMaker, GoogleAI, Microsoft Azure Machine Learning
- Data Augmentation as a best practice for addressing the Overfitting Challenge
- Scale training and validation data for AI applications. After sending us your data via API call, our platform through a combination of human work and review, smart tools, statistical confidence checks and machine learning checks returns scalable, accurate ground truth data.
Data augmentation is the process of using the data you currently have and modifying it in a realistic but randomized way, to increase the variety of data seen during training. As an example for images, slightly rotating, zooming, and/or translating the image will result in the same content, but with a different framing. This is representative of the real-world scenario, so will improve the training. It's worth double-checking that the output of the data augmentation is still realistic. To determine what types of augmentation to use, and how much of it, do some trial and error. Try each augmentation type on a sample set, with a variety of settings (e.g. 1% translation, 5% translation, 10% translation) and see what performs best on the sample set. Once you know the best setting for each augmentation type, try adding them all at the same time. | Deep Learning Course Wiki
Note: In Keras, we can perform transformations using ImageDataGenerator.
|
|
|
|
|
|
What does Data Augmentation mean? | Techopedia
Data augmentation adds value to base data by adding information derived from internal and external sources within an enterprise. Data is one of the core assets for an enterprise, making data management essential. Data augmentation can be applied to any form of data, but may be especially useful for customer data, sales patterns, product sales, where additional information can help provide more in-depth insight. Data augmentation can help reduce the manual intervention required to developed meaningful information and insight of business data, as well as significantly enhance data quality.
Data augmentation is of the last steps done in enterprise data management after monitoring, profiling and integration. Some of the common techniques used in data augmentation include:
- Extrapolation Technique: Based on heuristics. The relevant fields are updated or provided with values.
- Tagging Technique: Common records are tagged to a group, making it easier to understand and differentiate for the group.
- Aggregation Technique: Using mathematical values of averages and means, values are estimated for relevant fields if needed
- Probability Technique: Based on heuristics and analytical statistics, values are populated based on the probability of events.
Data Labeling
Youtube search... ...Google search
- Essential tips for scaling quality AI data labeling | Damian Rochman - VentureBeat
- Four Mistakes You Make When Labeling Data | Tal Perry Towards Data Science
- Building vs. Buying a training data annotation solution | Labelbox
- Data Labeling: Creating Ground Truth | Astasia Myers - Medium
- Tools/Services:
Labeling typically takes a set of unlabeled data and augments each piece of that unlabeled data with meaningful tags that are informative. Wikipedia
Automation has put low-skill jobs at risk for decades. And self-driving cars, robots, and speech recognition will continue the trend. But, some experts also see new opportunities in the automated age. ...the curation of data, where you take raw data and you clean it up and you have to kind of organize it for machines to ingest Is 'data labeling' the new blue-collar job of the AI era? | Hope Reese - TechRepublic
|
|
- 7 Ways to Get High-Quality Labeled Training Data at Low Cost | James Kobielus - KDnuggets
- How to Organize Data Labeling for Machine Learning: Approaches and Tools | AltexSoft KDnuggets
Auto-tagging
Youtube search... ...Google search
- ...predict categories (classification)
- Natural Language Processing (NLP)#Summarization / Paraphrasing
- Natural Language Tools & Services for Text labeling
- Image and video labeling:
- Annotorious the MIT-licensed free web image annotation and labeling tool. It allows for adding text comments and drawings to images on a website. The tool can be easily integrated with only two lines of additional code.
- OpenSeadragon An open-source, web-based viewer for high-resolution zoomable images, implemented in pure JavaScript, for desktop and mobile.
- LabelMe open online tool. Software must assist users in building image databases for computer vision research, its developers note. Users can also download the MATLAB toolbox that is designed for working with images in the LabelMe public dataset.
- Sloth allows users to label image and video files for computer vision research. Face recognition is one of Sloth’s common use cases.
- Object Tagging Tool (VoTT) labeling is one of the model development stages that VoTT supports. This tool also allows data scientists to train and validate object detection models.
- Labelbox build computer vision products for the real world. A complete solution for your training data problem with fast labeling tools, human workforce, data management, a powerful API and automation features.
- Alp’s Labeling Tool macro code allows easy labeling of images, and creates text files compatible with Detectnet / KITTI dataset format.
- imglab graphical tool for annotating images with object bounding boxes and optionally their part locations. Generally, you use it when you want to train an object detector (e.g. a face detector) since it allows you to easily create the needed training dataset.
- VGG Image Annotator (VIA) simple and standalone manual annotation software for image, audio and video
- Demon image annotation plugin allows you to add textual annotations to images by select a region of the image and then attach a textual description, the concept of annotating images with user comments. Integration with JQuery Image Annotation
- FastAnnotationTool (FIAT) enables image data annotation, data augmentation, data extraction, and result visualisation/validation.
- RectLabel an image annotation tool to label images for bounding box object detection and segmentation.
- Audio labeling:
- Praat free software for labeling audio files, mark timepoints of events in the audio file and annotate these events with text labels in a lightweight and portable TextGrid file.
- Speechalyzer a tool for the daily work of a 'speech worker'. It is optimized to process large speech data sets with respect to transcription, labeling and annotation.
- EchoML tool for audio file annotation. It allows users to visualize their data.
|
|
|
|
Synthetic Labeling
This approach entails generating data that imitates real data in terms of essential parameters set by a user. Synthetic data is produced by a generative model that is trained and validated on an original dataset. There are three types of generative models: (1) Generative Adversarial Network (GAN); generative/discriminative, (2) Autoregressive models (ARs); previous values, and (3) Variational Autoencoder (VAE); encoding/decoding.
|
|
Batch Norm(alization) & Standardization
Youtube search... ...Google search
To use numeric data for machine regression, you usually need to normalize the data. Otherwise, the numbers with larger ranges may tend to dominate the Euclidian distance between feature vectors, their effects can be magnified at the expense of the other fields, and the steepest descent optimization may have difficulty converging. There are a number of ways to normalize and standardize data for ML, including min-max normalization, mean normalization, standardization, and scaling to unit length. This process is often called feature scaling. Machine learning algorithms explained | Martin Heller - InfoWorld
When feeding data into a machine learning model, the data should usually be "normalized". This means scaling the data so that it has a mean and standard deviation within "reasonable" limits. This is to ensure the objective functions in the machine learning model will work as expected and not focus on a specific feature of the input data. Without normalizing inputs the model may be extremely fragile. Batch normalization is an extension of this concept. Instead of just normalizing the data at the input to the neural network, batch normalization adds layers to allow normalization to occur at the input to each convolutional layer. | Deep Learning Course Wiki
Batch Norm is a normalization method that normalizes Activation Functions in a network across the mini-batch. For each feature, batch normalization computes the mean and variance of that feature in the mini-batch. It then subtracts the mean and divides the feature by its mini-batch standard deviation.
The benefits of using batch normalization (batch norm) are:
- Improves gradient flow through the network
- Allows higher learning rates
- Reduces the strong dependence on initialization
- Acts as a form of regularization
Batch normalization has two elements:
- Normalize the inputs to the layer. This is the same as regular feature scaling or input normalization.
- Add two more trainable parameters. One for a gradient and one for an offset that apply to each of the activations. by adding these parameters, the normalization can effectively be completely undone, using the gradient and offset. This allows the back propagation process to completely ignore the back normalization layer if it wants to.
Good practices for addressing Overfitting Challenge:
- add more data
- use Data Augmentation
- use Batch Normalization
- use architectures that generalize well
- reduce architecture complexity
- add Regularization
- L1 and L2 Regularization - update the general cost function by adding another term known as the regularization term.
- Dropout - at every iteration, it randomly selects some nodes and temporarily removes the nodes (along with all of their incoming and outgoing connections)
- Data Augmentation
- Early Stopping
|
|
|
|
Zero Padding
Youtube search... ...Google search
- Pooling / Sub-sampling: Max, Mean
- Softmax
- Dimensional Reduction Algorithms
- (Deep) Convolutional Neural Network (DCNN/CNN)
|
|
Imbalanced Data
Youtube search... ...Google search
What is imbalanced data? The definition of imbalanced data is straightforward. A dataset is imbalanced if at least one of the classes constitutes only a very small minority. Imbalanced data prevail in banking, insurance, engineering, and many other fields. It is common in fraud detection that the imbalance is on the order of 100 to 1. ... The issue of class imbalance can result in a serious bias towards the majority class, reducing the classification performance and increasing the number of false negatives. How can we alleviate the issue? The most commonly used techniques are data resampling either under-sampling the majority of the class, or over-sampling the minority class, or a mix of both. | Dataman
Classification algorithms tend to perform poorly when data is skewed towards one class, as is often the case when tackling real-world problems such as fraud detection or medical diagnosis. A range of methods exist for addressing this problem, including re-sampling, one-class learning and cost-sensitive learning. | Natalie Hockham
|
|
|
|
|
|
Under-sampling
|
|
Over-sampling
|
|
Skewed Data
Youtube search... "Skewed+Data"+artificial+intelligence ...Google search
|
|
|
|