Difference between revisions of "Data Quality"
(Created page with "{{#seo: |title=PRIMO.ai |titlemode=append |keywords=artificial, intelligence, machine, learning, models, algorithms, data, singularity, moonshot, Tensorflow, Google, Nvidia, M...") |
m |
||
| Line 17: | Line 17: | ||
*** [[Data Preprocessing]] | *** [[Data Preprocessing]] | ||
*** [[Data Encoding]] | *** [[Data Encoding]] | ||
| − | |||
*** [[Feature Exploration/Learning]] | *** [[Feature Exploration/Learning]] | ||
*** [[Data Interoperability]] | *** [[Data Interoperability]] | ||
| Line 65: | Line 64: | ||
|} | |} | ||
|}<!-- B --> | |}<!-- B --> | ||
| + | |||
| + | = Data Cleaning = | ||
| + | [http://www.youtube.com/results?search_query=Data+Cleaning+machine+learning+ML YouTube search...] | ||
| + | [http://www.google.com/search?q=Data+Cleaning+machine+learning+ML ...Google search] | ||
| + | |||
| + | * [http://www.kaggle.com/rtatman/data-cleaning-challenge-json-txt-and-xls/ Data Cleaning Challenge: .json, .txt and .xls | Rachael Tatman] | ||
| + | * [[Data Preprocessing]] | ||
| + | * [http://towardsdatascience.com/machine-learning-for-data-cleaning-and-unification-b3213bbd18e Machine learning for data cleaning and unification | Abizer Jafferjee - Towards Data Science] | ||
| + | * [http://www.infoworld.com/article/3394399/machine-learning-algorithms-explained.html Machine learning algorithms explained | Martin Heller - InfoWorld] | ||
| + | |||
| + | When it comes to utilizing ML data, most of the time is spent on cleaning data sets or creating a dataset that is free of errors. Setting up a quality plan, filling missing values, removing rows, reducing data size are some of the best practices used for data cleaning in Machine Learning. [http://www.einfochips.com/blog/data-cleaning-in-machine-learning-best-practices-and-methods/ Data Cleaning in Machine Learning: Best Practices and Methods | Smishad Thomas] | ||
| + | |||
| + | Overall, incorrect data is either removed, corrected, or imputed... [http://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4 The Ultimate Guide to Data Cleaning | Omar Elgabry - Towards Data Science] | ||
| + | |||
| + | # Irrelevant data - are those that are not actually needed, and don’t fit under the context of the problem we’re trying to solve. | ||
| + | # Duplicates - are data points that are repeated in your dataset. | ||
| + | # Type conversion - Make sure numbers are stored as numerical data types. A date should be stored as a date object, or a Unix timestamp (number of seconds), and so on. Categorical values can be converted into and from numbers if needed. | ||
| + | # Syntax errors: | ||
| + | ## Remove extra white spaces | ||
| + | ## Pad strings - Strings can be padded with spaces or other characters to a certain width | ||
| + | ## Fix typos - Strings can be entered in many different ways | ||
| + | # [[Batch Norm(alization) & Standardization|Standardize]] format | ||
| + | # Scaling / Transformation - scaling means to transform your data so that it fits within a specific scale, such as 0–100 or 0–1. | ||
| + | # [[Batch Norm(alization) & Standardization|Normalization]] - also rescales the values into a range of 0–1, the intention here is to transform the data so that it is normally distributed. | ||
| + | # Missing values: | ||
| + | ## Drop - If the missing values in a column rarely happen and occur at random, then the easiest and most forward solution is to drop observations (rows) that have missing values. | ||
| + | ## Impute - It means to calculate the missing value based on other observations. | ||
| + | ## Flag | ||
| + | # Outliers - They are values that are significantly different from all other observations...they should not be removed unless there is a good reason for that. | ||
| + | # In-record & cross-datasets errors - result from having two or more values in the same row or across datasets that contradict with each other. | ||
| + | |||
| + | |||
| + | <youtube>NsD6Wn4KSFY</youtube> | ||
| + | <youtube>JkYE7ghu1UE</youtube> | ||
| + | <youtube>hEmSa1bJZpk</youtube> | ||
| + | <youtube>wN9DUZK5Gng</youtube> | ||
Revision as of 15:29, 19 September 2020
YouTube search... Quora search... ...Google search
- AI Governance
- Hyperparameters
- Automated Machine Learning (AML) - AutoML
- Visualization
- Evaluation
- Great Expectations ...helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.
|
|
|
|
Data Cleaning
YouTube search... ...Google search
- Data Cleaning Challenge: .json, .txt and .xls | Rachael Tatman
- Data Preprocessing
- Machine learning for data cleaning and unification | Abizer Jafferjee - Towards Data Science
- Machine learning algorithms explained | Martin Heller - InfoWorld
When it comes to utilizing ML data, most of the time is spent on cleaning data sets or creating a dataset that is free of errors. Setting up a quality plan, filling missing values, removing rows, reducing data size are some of the best practices used for data cleaning in Machine Learning. Data Cleaning in Machine Learning: Best Practices and Methods | Smishad Thomas
Overall, incorrect data is either removed, corrected, or imputed... The Ultimate Guide to Data Cleaning | Omar Elgabry - Towards Data Science
- Irrelevant data - are those that are not actually needed, and don’t fit under the context of the problem we’re trying to solve.
- Duplicates - are data points that are repeated in your dataset.
- Type conversion - Make sure numbers are stored as numerical data types. A date should be stored as a date object, or a Unix timestamp (number of seconds), and so on. Categorical values can be converted into and from numbers if needed.
- Syntax errors:
- Remove extra white spaces
- Pad strings - Strings can be padded with spaces or other characters to a certain width
- Fix typos - Strings can be entered in many different ways
- Standardize format
- Scaling / Transformation - scaling means to transform your data so that it fits within a specific scale, such as 0–100 or 0–1.
- Normalization - also rescales the values into a range of 0–1, the intention here is to transform the data so that it is normally distributed.
- Missing values:
- Drop - If the missing values in a column rarely happen and occur at random, then the easiest and most forward solution is to drop observations (rows) that have missing values.
- Impute - It means to calculate the missing value based on other observations.
- Flag
- Outliers - They are values that are significantly different from all other observations...they should not be removed unless there is a good reason for that.
- In-record & cross-datasets errors - result from having two or more values in the same row or across datasets that contradict with each other.