Difference between revisions of "Data Quality"
m (→Data Cleaning) |
m |
||
| Line 100: | Line 100: | ||
<youtube>hEmSa1bJZpk</youtube> | <youtube>hEmSa1bJZpk</youtube> | ||
<youtube>wN9DUZK5Gng</youtube> | <youtube>wN9DUZK5Gng</youtube> | ||
| + | |||
| + | |||
| + | = <span id="Data Encoding"></span>Data Encoding = | ||
| + | |||
| + | [http://www.youtube.com/results?search_query=Data+Encoding+machine+learning+ML YouTube search...] | ||
| + | [http://www.google.com/search?q=Data+Encoding+machine+learning+ML ...Google search] | ||
| + | |||
| + | * [[Data Preprocessing]] | ||
| + | * [[...predict categories]] with classification | ||
| + | * [[Few Shot Learning]] | ||
| + | |||
| + | To use categorical data for machine classification, you need to encode the text labels into another form. There are two common encodings. | ||
| + | # One is label encoding, which means that each text label value is replaced with a number. | ||
| + | # The other is one-hot encoding, which means that each text label value is turned into a column with a binary value (1 or 0). Most machine learning frameworks have functions that do the conversion for you. In general, one-hot encoding is preferred, as label encoding can sometimes confuse the machine learning algorithm into thinking that the encoded column is ordered. [http://www.infoworld.com/article/3394399/machine-learning-algorithms-explained.html Machine learning algorithms explained | Martin Heller - InfoWorld] | ||
| + | |||
| + | <youtube>9yl6-HEY7_s</youtube> | ||
| + | <youtube>EgtlklP_mwU</youtube> | ||
| + | <youtube>v_4KWmkwmsU</youtube> | ||
| + | <youtube>EQ7z6LsDe0E</youtube> | ||
Revision as of 15:34, 19 September 2020
YouTube search... Quora search... ...Google search
- AI Governance
- Hyperparameters
- Automated Machine Learning (AML) - AutoML
- Visualization
- Evaluation
- Great Expectations ...helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.
|
|
|
|
Data Cleaning
YouTube search... ...Google search
- Data Cleaning Challenge: .json, .txt and .xls | Rachael Tatman
- Data Preprocessing
- Machine learning for data cleaning and unification | Abizer Jafferjee - Towards Data Science
- Machine learning algorithms explained | Martin Heller - InfoWorld
When it comes to utilizing ML data, most of the time is spent on cleaning data sets or creating a dataset that is free of errors. Setting up a quality plan, filling missing values, removing rows, reducing data size are some of the best practices used for data cleaning in Machine Learning. Data Cleaning in Machine Learning: Best Practices and Methods | Smishad Thomas
Overall, incorrect data is either removed, corrected, or imputed... The Ultimate Guide to Data Cleaning | Omar Elgabry - Towards Data Science
- Irrelevant data - are those that are not actually needed, and don’t fit under the context of the problem we’re trying to solve.
- Duplicates - are data points that are repeated in your dataset.
- Type conversion - Make sure numbers are stored as numerical data types. A date should be stored as a date object, or a Unix timestamp (number of seconds), and so on. Categorical values can be converted into and from numbers if needed.
- Syntax errors:
- Remove extra white spaces
- Pad strings - Strings can be padded with spaces or other characters to a certain width
- Fix typos - Strings can be entered in many different ways
- Standardize format
- Scaling / Transformation - scaling means to transform your data so that it fits within a specific scale, such as 0–100 or 0–1.
- Normalization - also rescales the values into a range of 0–1, the intention here is to transform the data so that it is normally distributed.
- Missing values:
- Drop - If the missing values in a column rarely happen and occur at random, then the easiest and most forward solution is to drop observations (rows) that have missing values.
- Impute - It means to calculate the missing value based on other observations.
- Flag
- Outliers - They are values that are significantly different from all other observations...they should not be removed unless there is a good reason for that.
- In-record & cross-datasets errors - result from having two or more values in the same row or across datasets that contradict with each other.
Data Encoding
YouTube search... ...Google search
- Data Preprocessing
- ...predict categories with classification
- Few Shot Learning
To use categorical data for machine classification, you need to encode the text labels into another form. There are two common encodings.
- One is label encoding, which means that each text label value is replaced with a number.
- The other is one-hot encoding, which means that each text label value is turned into a column with a binary value (1 or 0). Most machine learning frameworks have functions that do the conversion for you. In general, one-hot encoding is preferred, as label encoding can sometimes confuse the machine learning algorithm into thinking that the encoded column is ordered. Machine learning algorithms explained | Martin Heller - InfoWorld