Difference between revisions of "Datasets"

Benchmarks
Batch Norm(alization) & Standardization
Data Preprocessing
Feature Exploration/Learning
Hyperparameters
Data Augmentation, Data Labeling, and Auto-Tagging
Visualization
Master Data Management (MDM) / Feature Store / Data Lineage / Data Catalog
Managed Vocabularies
OpenML datasets

Datasets (often in combination with algorithms) are becoming more important themselves and can sometimes be seen as the primary intellectual output of the research. The revelations about Cambridge Analytica highlights the importance of datasets and data collection. Reference also: Privacy in Data Science

Sources

Tatoeba a collection of sentences and translations - Tab-delimited Bilingual Sentence Pairs
Kaggle Datasets
UC Irvine Machine Learning Repository
- Archive | UC Irvine Machine Learning Repository
MNIST database
Collections | DataHub
Registry of Open Data on AWS | Amazon
Public Data | Google
BigQuery public datasets | Google
Open Images | Google
Data Science for Research | Microsoft
Datasets for Data Mining and Data Science | KDnuggets
Enigma Public
A Comprehensive List of Open Data Portals from Around the World | DataPortals.org
OpenDataSoft
World Data Atlas | Knoema
The Open Machine Learning project | OpenML.org
World's Free Online Data | Research Pipeline
List of datasets for machine learning research | Wikipedia
Neural Net Repository | Wolfram
Open Data for Deep Learning & Machine Learning | 4j
Data Catalog | Data.gov
3D-Machine-Learning | GitHub
- 3D Models
- 3D Scenes
Wind Turbine Map and Database | USGS & DOE
Autosomal DNA
Pascal Visual Object Classes Challenge (VOC)
OpenNASA
JASA Data Archive | Journal of the American Statistical Association
Datasets Archive | Journal of the American Statistical Association
Data.World
The Dataset Collection | Archive.org
Collections |Archive-it.org
Eurostat | EU statistical office
Re3data
Resource on data and metadata standards - open research data | FAIRsharing
List of Public Data Sources Fit for Machine Learning | bigml
Open Datasets | Skymind
Global Health Observatory resources | World Health Organization (WHO)
CDC WONDER | Center for Disease Control (CDC)
US health insurance program | Medicare
International economy |International Monetary Fund (IMF)
Data Catalog }| The World Bank
Financial and economic | Quandl
- Alternative data | Quandl
PublicDomains | GitHub
datasets and related content | BuzzFeed - GitHub
Sports, politics, economics, and other spheres of life | FiveThirtyEight
EMBER; benign and malicious Windows-portable executable files | Endgame - GitHub
r/datasets | reddit
Microsoft Information-Seeking Conversation (MISC) - audio and video signals; transcripts of conversation
Language-Independent Named Entity Recognition (II)
VGG | Oxford
Perfect-500K beauty and personal care
Mozilla’s Common Voice project collect human voices
COVID-19 Open Research Dataset (CORD-19) in response to the COVID-19 pandemic

Networks

Bidirectional Encoder Representations from Transformers (BERT)
ResNet-50
ImageNet | Wikipedia
AlexNet | Wikipedia
WordNet

Articles

Microsoft Scraps 10 Million Facial Recognition Photos On The Low | Kori Hale -Forbes
The 50 Best Free Datasets for Machine Learning | Meiryum Ali - Gengo AI
The 50 Best Public Datasets for Machine Learning | Stacy Stanford - Medium
Best Public Datasets for Machine Learning and Data Science: Sources and Advice on the Choice | Altexsoft
25 Open Datasets for Deep Learning Every Data Scientist Must Work With | PRANAV DAR - Analytics Vidhya

Human in the Loop...
- Amazon Mechanical Turk (MTurk) - Using MTurk with Amazon SageMaker for Supervised Learning (ML)
- Gengo.ai - high-quality multilingual data with a human touch for machine learning
- Figure Eight CrowdFlower AI - build a state-of-the-art machine learning model trained with human labeled data

@@ Line 78: / Line 78: @@
 * [http://challenge2019.perfectcorp.com/ Perfect-500K] beauty and personal care
 * [http://voice.mozilla.org/en Mozilla’s Common Voice project] collect human voices
+* [http://pages.semanticscholar.org/coronavirus-research COVID-19 Open Research Dataset (CORD-19)] in response to the COVID-19 pandemic
 == Networks ==

Difference between revisions of "Datasets"

Revision as of 07:19, 18 March 2020

Sources

Networks

Articles

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools