Difference between revisions of "Datasets"

Revision as of 12:30, 7 September 2020

AI Governance
- Data Governance
  - Data Science
  - Master Data Management (MDM) / Feature Store / Data Lineage / Data Catalog
  - Managed Vocabularies
  - Benchmarks
  - Batch Norm(alization) & Standardization
  - Data Preprocessing
  - Feature Exploration/Learning
  - Data Augmentation, Data Labeling, and Auto-Tagging
Hyperparameters
Visualization
- Facets | Google...contains two robust Visualizations to aid in understanding and analyzing machine learning datasets.
OpenML datasets
Datasets and Machine Learning | Chris Nicholson - A.I. Wiki pathmind
Datasets used in deep learning applications within X-ray security imaging | Towards Automatic Threat Detection: A Survey of Advances of Deep Learning within X-ray Security Imaging | Samet Akcay and Toby P. Breckon - Durham University, UK

Datasets (often in combination with algorithms) are becoming more important themselves and can sometimes be seen as the primary intellectual output of the research. The revelations about Cambridge Analytica highlights the importance of datasets and data collection. Reference also: Privacy in Data Science

Sources

Tatoeba a collection of sentences and translations - Tab-delimited Bilingual Sentence Pairs
Kaggle Datasets
COVID-19 Open Research Dataset (CORD-19) ...COVID-19
UC Irvine Machine Learning Repository
MNIST database
Collections | DataHub
Registry of Open Data on AWS | Amazon
Public Data | Google
BigQuery public datasets | Google
Open Images | Google
Data Science for Research | Microsoft
Datasets for Data Mining and Data Science | KDnuggets
Enigma Public
A Comprehensive List of Open Data Portals from Around the World | DataPortals.org
OpenDataSoft
World Data Atlas | Knoema
The Open Machine Learning project | OpenML.org
World's Free Online Data | Research Pipeline
List of datasets for machine learning research | Wikipedia
Neural Net Repository | Wolfram
Open Data for Deep Learning & Machine Learning | 4j
Data Catalog | Data.gov
3D-Machine-Learning | GitHub
- 3D Models
- 3D Scenes
Wind Turbine Map and Database | USGS & DOE
Autosomal DNA
Pascal Visual Object Classes Challenge (VOC)
OpenNASA
Data: Close encounters between two objects |European Space Agency (ESA)
JASA Data Archive | Journal of the American Statistical Association
Datasets Archive | Journal of the American Statistical Association
Data.World
The Dataset Collection | Archive.org
Collections |Archive-it.org
Eurostat | EU statistical office
Re3data
Resource on data and metadata standards - open research data | FAIRsharing
List of Public Data Sources Fit for Machine Learning | bigml
Open Datasets | Skymind
Global Health Observatory resources | World Health Organization (WHO)
CDC WONDER | Center for Disease Control (CDC)
US health insurance program | Medicare
International economy |International Monetary Fund (IMF)
Data Catalog }| The World Bank
Financial and economic | Quandl
- Alternative data | Quandl
PublicDomains | GitHub
datasets and related content | BuzzFeed - GitHub
Sports, politics, economics, and other spheres of life | FiveThirtyEight
EMBER; benign and malicious Windows-portable executable files | Endgame - GitHub
r/datasets | reddit
Microsoft Information-Seeking Conversation (MISC) - audio and video signals; transcripts of conversation
Language-Independent Named Entity Recognition (II)
VGG | Oxford
Perfect-500K beauty and personal care
Mozilla’s Common Voice project collect human voices
CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset. | A. Krizhevsky, V. Nair, and G. Hinton - Canadian Institute For Advanced Research]

Networks

Articles

HH1 BB1

HH2 BB2

HH3 BB3

HH4 BB4

HH5 BB5

HH6 BB6

HH7 BB7

HH8 BB8

HH9 BB9

HH10 BB10

HH1 BB1

Question Answering Beyond SQuAD: Larger Datasets and New Domains, with Branden Chan, deepset.ai Branden Chan, an NLP Engineer at deepset.ai in Berlin, presents on Question Answering Beyond SQuAD: Larger Datasets and New Domains in an online program, May 26, 2020, organized and moderated by Seth Grimes for the New York Natural Language Processing meetup (https://www.meetup.com/NY-NLP) and partners.

HH1 BB1

HH2 BB2

Human in the Loop...
- Amazon Mechanical Turk (MTurk) - Using MTurk with Amazon SageMaker for Supervised Learning (ML)
- Gengo.ai - high-quality multilingual data with a human touch for machine learning
- Figure Eight CrowdFlower AI - build a state-of-the-art machine learning model trained with human labeled data

@@ Line 199: / Line 199: @@
 {| class="wikitable" style="width: 550px;"
 ||
-<youtube>koiTTim4M-s</youtube>
+<youtube>E80qHThomok</youtube>
-<b>HH2
+<b>Question Answering Beyond SQuAD: Larger Datasets and New Domains, with Branden Chan, deepset.ai
-</b><br>BB2
+</b><br>Branden Chan, an NLP Engineer at deepset.ai in Berlin, presents on Question Answering Beyond SQuAD: Larger Datasets and New Domains in an online program, May 26, 2020, organized and moderated by Seth Grimes for the New York Natural Language Processing meetup (https://www.meetup.com/NY-NLP) and partners.
 |}
 |}<!-- B -->
-{|<!-- T -->
-| valign="top" |
-{| class="wikitable" style="width: 550px;"
-||
-<youtube>tChcZpBbTTA</youtube>
-<b>HH1
-</b><br>BB1
-|}
-|}<!-- B -->
 {|<!-- T -->
 | valign="top" |

Difference between revisions of "Datasets"

Revision as of 12:30, 7 September 2020

Sources

Networks

Articles

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools