Data Quality

From
Revision as of 15:29, 19 September 2020 by BPeat (talk | contribs)
Jump to: navigation, search

YouTube search... Quora search... ...Google search

Testing and Documenting Your Data Doesn't Have to Suck | Superconductive
Data teams everywhere struggle with pipeline debt: untested, undocumented assumptions that drain productivity, erode trust in data and kill team morale. Unfortunately, rolling your own data validation tooling usually takes weeks or months. In addition, most teams suffer from “documentation rot,” where data documentation is hard to maintain, and therefore chronically outdated, incomplete, and only semi-trusted. Great Expectations - http://bit.ly/2OtmY1W, the leading open source project for fighting pipeline debt, can solve these problems for you. We're excited to share new features and under-the-hood architecture with the data community. ABOUT THE SPEAKER Abe Gong is a core contributor to the Great Expectations open source library, and CEO and Co-founder at Superconductive. Prior to Superconductive, Abe was Chief Data Officer at Aspire Health, the founding member of the Jawbone data science team, and lead data scientist at Massive Health. Abe has been leading teams using data and technology to solve problems in health care, consumer wellness, and public policy for over a decade. Abe earned his PhD at the University of Michigan in Public Policy, Political Science, and Complex Systems. He speaks and writes regularly on data, healthcare, and data ethics.

"Data Quality Check In Machine Learning"
The world of data quality check in Machine Learning is expanding at an unimaginable pace. Researchers estimate that by 2020, every human would create 1.7MB of information each second. The true power of data can be unlocked when it is refined and transformed into a high quality state where we can realize its true potential. Many businesses and researchers believe that data quality is one of the primary concerns for data-driven enterprises and associated processes considering the pace of data growth. Most of the operational processes and analytics rely on good quality data for being efficient and consistent in output.Data quality process has evolved in its capacity but the demand for pace and efficiency has been proliferating extensively. Data management experts believe that data quality remains a bottleneck that creeps repeatedly to bother the data management and business fraternity due to proliferating data volumes and the complexity involved to derive quality insights. Innovative technologies such as Big Data, AI, ML etc.ML algorithms can learn from human decision labels in the training datasets and replicate the scenarios in real-time. However, ML algorithms are also prone to biases that may reflect in these data sets and are learnt through fresh data sets. These biases could lead to erosion of data quality. External validity testing and audits on a regular basis will help in avoiding such situations.

An Approach to Data Quality for Netflix Personalization Systems
Personalization is one of the key pillars of Netflix as it enables each member to experience the vast collection of content tailored to their interests. Our personalization system is powered by several machine learning models. These models are only as good as the data that is fed to them. They are trained using hundreds of terabytes of data everyday, that make it a non-trivial challenge to track and maintain data quality. To ensure high data quality, we require three things: automated monitoring of data; visualization to observe changes in the metrics over time; and mechanisms to control data related regressions, wherein a data regression is defined as data loss or distributional shifts over a given period of time. In this talk, we will describe infrastructure and methods that we used to achieve the above: – ‘Swimlanes’ that help us define data boundaries for different environments that are used to develop, evaluate and deploy ML models, – Pipelines that aggregate data metrics from various sources within each swimlane – Time series and dashboard visualization tools across an atypically larger period of time – Automated audits that periodically monitor these metrics to detect data regressions. We will explain how we run aggregation jobs to optimize metric computations, SQL queries to quickly define/test individual metrics and other ETL jobs to power the visualization/audits tools using Spark.’ About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business. Connect with us: Website: http://databricks.com Facebook: http://www.facebook.com/databricksinc

HH2
BB2

Data Cleaning

YouTube search... ...Google search

When it comes to utilizing ML data, most of the time is spent on cleaning data sets or creating a dataset that is free of errors. Setting up a quality plan, filling missing values, removing rows, reducing data size are some of the best practices used for data cleaning in Machine Learning. Data Cleaning in Machine Learning: Best Practices and Methods | Smishad Thomas

Overall, incorrect data is either removed, corrected, or imputed... The Ultimate Guide to Data Cleaning | Omar Elgabry - Towards Data Science

  1. Irrelevant data - are those that are not actually needed, and don’t fit under the context of the problem we’re trying to solve.
  2. Duplicates - are data points that are repeated in your dataset.
  3. Type conversion - Make sure numbers are stored as numerical data types. A date should be stored as a date object, or a Unix timestamp (number of seconds), and so on. Categorical values can be converted into and from numbers if needed.
  4. Syntax errors:
    1. Remove extra white spaces
    2. Pad strings - Strings can be padded with spaces or other characters to a certain width
    3. Fix typos - Strings can be entered in many different ways
  5. Standardize format
  6. Scaling / Transformation - scaling means to transform your data so that it fits within a specific scale, such as 0–100 or 0–1.
  7. Normalization - also rescales the values into a range of 0–1, the intention here is to transform the data so that it is normally distributed.
  8. Missing values:
    1. Drop - If the missing values in a column rarely happen and occur at random, then the easiest and most forward solution is to drop observations (rows) that have missing values.
    2. Impute - It means to calculate the missing value based on other observations.
    3. Flag
  9. Outliers - They are values that are significantly different from all other observations...they should not be removed unless there is a good reason for that.
  10. In-record & cross-datasets errors - result from having two or more values in the same row or across datasets that contradict with each other.