Difference between revisions of "Datasets"
(27 intermediate revisions by the same user not shown) | |||
Line 8: | Line 8: | ||
[http://www.google.com/search?q=datasets+training+deep+machine+learning+artificial+intelligence+ML+AI ...Google search] | [http://www.google.com/search?q=datasets+training+deep+machine+learning+artificial+intelligence+ML+AI ...Google search] | ||
+ | * [[Benchmarks]] | ||
* [[Batch Norm(alization) & Standardization]] | * [[Batch Norm(alization) & Standardization]] | ||
* [[Data Preprocessing]] | * [[Data Preprocessing]] | ||
* [[Feature Exploration/Learning]] | * [[Feature Exploration/Learning]] | ||
− | * [[ | + | * [[Hyperparameter]]s |
− | * [[Data Augmentation]] | + | * [[Data Augmentation]], Data Labeling, and Auto-Tagging |
* [[Visualization]] | * [[Visualization]] | ||
* [[Master Data Management (MDM) / Feature Store / Data Lineage / Data Catalog]] | * [[Master Data Management (MDM) / Feature Store / Data Lineage / Data Catalog]] | ||
+ | * [[Natural Language Processing (NLP)#Managed Vocabularies |Managed Vocabularies]] | ||
+ | * [http://www.openml.org/search?type=data OpenML datasets] | ||
+ | * [http://pathmind.com/wiki/datasets-ml Datasets and Machine Learning | Chris Nicholson - A.I. Wiki pathmind] | ||
Datasets (often in combination with algorithms) are becoming more important themselves and can sometimes be seen as the primary intellectual output of the research. The revelations about [http://news.google.com/topics/CAAqKAgKIiJDQkFTRXdvTkwyY3ZNVEZqYkd4cWMyMDNOQklDWlc0b0FBUAE Cambridge Analytica] highlights the importance of datasets and data collection. Reference also: [[Privacy in Data Science]] | Datasets (often in combination with algorithms) are becoming more important themselves and can sometimes be seen as the primary intellectual output of the research. The revelations about [http://news.google.com/topics/CAAqKAgKIiJDQkFTRXdvTkwyY3ZNVEZqYkd4cWMyMDNOQklDWlc0b0FBUAE Cambridge Analytica] highlights the importance of datasets and data collection. Reference also: [[Privacy in Data Science]] | ||
== Sources == | == Sources == | ||
− | + | * [http://tatoeba.org/eng Tatoeba] a collection of sentences and translations - [http://www.manythings.org/anki/ Tab-delimited Bilingual Sentence Pairs] | |
− | |||
* [http://www.kaggle.com/datasets Kaggle Datasets] | * [http://www.kaggle.com/datasets Kaggle Datasets] | ||
+ | * [http://pages.semanticscholar.org/coronavirus-research COVID-19 Open Research Dataset (CORD-19)] ...[[COVID-19]] | ||
* [http://mlr.cs.umass.edu/ml/ UC Irvine Machine Learning Repository] | * [http://mlr.cs.umass.edu/ml/ UC Irvine Machine Learning Repository] | ||
** [http://archive.ics.uci.edu/ml/datasets.html Archive | UC Irvine Machine Learning Repository] | ** [http://archive.ics.uci.edu/ml/datasets.html Archive | UC Irvine Machine Learning Repository] | ||
Line 28: | Line 32: | ||
* [http://registry.opendata.aws/ Registry of Open Data on AWS | Amazon] | * [http://registry.opendata.aws/ Registry of Open Data on AWS | Amazon] | ||
* [http://www.google.com/publicdata/directory Public Data | Google] | * [http://www.google.com/publicdata/directory Public Data | Google] | ||
+ | * [http://cloud.google.com/bigquery/public-data/ BigQuery public datasets | Google] | ||
* [http://storage.googleapis.com/openimages/web/index.html Open Images | Google] | * [http://storage.googleapis.com/openimages/web/index.html Open Images | Google] | ||
* [http://www.microsoft.com/en-us/research/academic-program/data-science-microsoft-research/ Data Science for Research | Microsoft] | * [http://www.microsoft.com/en-us/research/academic-program/data-science-microsoft-research/ Data Science for Research | Microsoft] | ||
Line 41: | Line 46: | ||
* [http://deeplearning4j.org/opendata Open Data for Deep Learning & Machine Learning | 4j] | * [http://deeplearning4j.org/opendata Open Data for Deep Learning & Machine Learning | 4j] | ||
* [http://catalog.data.gov/dataset Data Catalog | Data.gov] | * [http://catalog.data.gov/dataset Data Catalog | Data.gov] | ||
+ | * [http://github.com/timzhang642/3D-Machine-Learning#datasets 3D-Machine-Learning | GitHub] | ||
+ | ** [http://github.com/timzhang642/3D-Machine-Learning#3d_models 3D Models] | ||
+ | ** [http://github.com/timzhang642/3D-Machine-Learning#3d_scenes 3D Scenes] | ||
* [http://www.usgs.gov/news/us-geological-survey-and-us-department-energy-release-online-public-dataset-and-viewer-us-wind Wind Turbine Map and Database | USGS & DOE] | * [http://www.usgs.gov/news/us-geological-survey-and-us-department-energy-release-online-public-dataset-and-viewer-us-wind Wind Turbine Map and Database | USGS & DOE] | ||
* [http://isogg.org/wiki/Autosomal_DNA_testing_comparison_chart Autosomal DNA] | * [http://isogg.org/wiki/Autosomal_DNA_testing_comparison_chart Autosomal DNA] | ||
Line 67: | Line 75: | ||
* [http://github.com/endgameinc/ember EMBER; benign and malicious Windows-portable executable files | Endgame - GitHub] | * [http://github.com/endgameinc/ember EMBER; benign and malicious Windows-portable executable files | Endgame - GitHub] | ||
* [http://www.reddit.com/r/datasets/ r/datasets | reddit] | * [http://www.reddit.com/r/datasets/ r/datasets | reddit] | ||
+ | * [http://www.microsoft.com/en-us/download/details.aspx?id=55594&WT.mc_id=rss_alldownloads_all Microsoft Information-Seeking Conversation (MISC)] - audio and video signals; transcripts of conversation | ||
+ | * [http://www.clips.uantwerpen.be/conll2003/ner/ Language-Independent Named Entity Recognition (II)] | ||
+ | * [http://www.robots.ox.ac.uk/~vgg/data/vgg_face/ VGG | Oxford] | ||
+ | * [http://challenge2019.perfectcorp.com/ Perfect-500K] beauty and personal care | ||
+ | * [http://voice.mozilla.org/en Mozilla’s Common Voice project] collect human voices | ||
+ | * [http://pages.semanticscholar.org/coronavirus-research COVID-19 Open Research Dataset (CORD-19)] in response to the COVID-19 pandemic | ||
+ | |||
+ | == Networks == | ||
+ | * [[Bidirectional Encoder Representations from Transformers (BERT)]] | ||
+ | * [[ResNet-50]] | ||
+ | * [http://en.wikipedia.org/wiki/ImageNet ImageNet | Wikipedia] | ||
+ | * [http://en.wikipedia.org/wiki/AlexNet AlexNet | Wikipedia] | ||
+ | * [http://wordnet.princeton.edu/ WordNet] | ||
== Articles == | == Articles == | ||
+ | * [http://www.forbes.com/sites/korihale/2019/06/25/microsoft-scraps-10-million-facial-recognition-photos-on-the-low/#6672d61949f2 Microsoft Scraps 10 Million Facial Recognition Photos On The Low | Kori Hale -Forbes] | ||
* [http://gengo.ai/datasets/the-50-best-free-datasets-for-machine-learning/ The 50 Best Free Datasets for Machine Learning | Meiryum Ali - Gengo AI] | * [http://gengo.ai/datasets/the-50-best-free-datasets-for-machine-learning/ The 50 Best Free Datasets for Machine Learning | Meiryum Ali - Gengo AI] | ||
* [http://medium.com/datadriveninvestor/the-50-best-public-datasets-for-machine-learning-d80e9f030279 The 50 Best Public Datasets for Machine Learning | Stacy Stanford - Medium] | * [http://medium.com/datadriveninvestor/the-50-best-public-datasets-for-machine-learning-d80e9f030279 The 50 Best Public Datasets for Machine Learning | Stacy Stanford - Medium] | ||
* [http://www.altexsoft.com/blog/datascience/best-public-machine-learning-datasets/ Best Public Datasets for Machine Learning and Data Science: Sources and Advice on the Choice | Altexsoft] | * [http://www.altexsoft.com/blog/datascience/best-public-machine-learning-datasets/ Best Public Datasets for Machine Learning and Data Science: Sources and Advice on the Choice | Altexsoft] | ||
* [http://www.analyticsvidhya.com/blog/2018/03/comprehensive-collection-deep-learning-datasets/ 25 Open Datasets for Deep Learning Every Data Scientist Must Work With | PRANAV DAR - Analytics Vidhya] | * [http://www.analyticsvidhya.com/blog/2018/03/comprehensive-collection-deep-learning-datasets/ 25 Open Datasets for Deep Learning Every Data Scientist Must Work With | PRANAV DAR - Analytics Vidhya] | ||
+ | |||
− | + | <youtube>jYvBmJo7qjc</youtube> | |
− | <youtube> | ||
− | |||
<youtube>dGM1mgkIayY</youtube> | <youtube>dGM1mgkIayY</youtube> | ||
<youtube>tTjZqz6qk1s</youtube> | <youtube>tTjZqz6qk1s</youtube> | ||
Line 87: | Line 108: | ||
<youtube>KoA1lVRwHrc</youtube> | <youtube>KoA1lVRwHrc</youtube> | ||
<youtube>M1WQxTofGe8</youtube> | <youtube>M1WQxTofGe8</youtube> | ||
+ | <youtube>koiTTim4M-s</youtube> | ||
+ | <youtube>tChcZpBbTTA</youtube> | ||
* [http://www.quora.com/What-are-the-alternatives-to-CrowdFlower Human in the Loop...] | * [http://www.quora.com/What-are-the-alternatives-to-CrowdFlower Human in the Loop...] |
Revision as of 16:42, 26 April 2020
YouTube search... ...Google search
- Benchmarks
- Batch Norm(alization) & Standardization
- Data Preprocessing
- Feature Exploration/Learning
- Hyperparameters
- Data Augmentation, Data Labeling, and Auto-Tagging
- Visualization
- Master Data Management (MDM) / Feature Store / Data Lineage / Data Catalog
- Managed Vocabularies
- OpenML datasets
- Datasets and Machine Learning | Chris Nicholson - A.I. Wiki pathmind
Datasets (often in combination with algorithms) are becoming more important themselves and can sometimes be seen as the primary intellectual output of the research. The revelations about Cambridge Analytica highlights the importance of datasets and data collection. Reference also: Privacy in Data Science
Sources
- Tatoeba a collection of sentences and translations - Tab-delimited Bilingual Sentence Pairs
- Kaggle Datasets
- COVID-19 Open Research Dataset (CORD-19) ...COVID-19
- UC Irvine Machine Learning Repository
- MNIST database
- Collections | DataHub
- Registry of Open Data on AWS | Amazon
- Public Data | Google
- BigQuery public datasets | Google
- Open Images | Google
- Data Science for Research | Microsoft
- Datasets for Data Mining and Data Science | KDnuggets
- Enigma Public
- A Comprehensive List of Open Data Portals from Around the World | DataPortals.org
- OpenDataSoft
- World Data Atlas | Knoema
- The Open Machine Learning project | OpenML.org
- World's Free Online Data | Research Pipeline
- List of datasets for machine learning research | Wikipedia
- Neural Net Repository | Wolfram
- Open Data for Deep Learning & Machine Learning | 4j
- Data Catalog | Data.gov
- 3D-Machine-Learning | GitHub
- Wind Turbine Map and Database | USGS & DOE
- Autosomal DNA
- Pascal Visual Object Classes Challenge (VOC)
- OpenNASA
- JASA Data Archive | Journal of the American Statistical Association
- Datasets Archive | Journal of the American Statistical Association
- Data.World
- The Dataset Collection | Archive.org
- Collections |Archive-it.org
- Eurostat | EU statistical office
- Re3data
- Resource on data and metadata standards - open research data | FAIRsharing
- List of Public Data Sources Fit for Machine Learning | bigml
- Open Datasets | Skymind
- Global Health Observatory resources | World Health Organization (WHO)
- CDC WONDER | Center for Disease Control (CDC)
- US health insurance program | Medicare
- International economy |International Monetary Fund (IMF)
- Data Catalog }| The World Bank
- Financial and economic | Quandl
- PublicDomains | GitHub
- datasets and related content | BuzzFeed - GitHub
- Sports, politics, economics, and other spheres of life | FiveThirtyEight
- EMBER; benign and malicious Windows-portable executable files | Endgame - GitHub
- r/datasets | reddit
- Microsoft Information-Seeking Conversation (MISC) - audio and video signals; transcripts of conversation
- Language-Independent Named Entity Recognition (II)
- VGG | Oxford
- Perfect-500K beauty and personal care
- Mozilla’s Common Voice project collect human voices
- COVID-19 Open Research Dataset (CORD-19) in response to the COVID-19 pandemic
Networks
- Bidirectional Encoder Representations from Transformers (BERT)
- ResNet-50
- ImageNet | Wikipedia
- AlexNet | Wikipedia
- WordNet
Articles
- Microsoft Scraps 10 Million Facial Recognition Photos On The Low | Kori Hale -Forbes
- The 50 Best Free Datasets for Machine Learning | Meiryum Ali - Gengo AI
- The 50 Best Public Datasets for Machine Learning | Stacy Stanford - Medium
- Best Public Datasets for Machine Learning and Data Science: Sources and Advice on the Choice | Altexsoft
- 25 Open Datasets for Deep Learning Every Data Scientist Must Work With | PRANAV DAR - Analytics Vidhya
- Human in the Loop...
- Amazon Mechanical Turk (MTurk) - Using MTurk with Amazon SageMaker for Supervised Learning (ML)
- Gengo.ai - high-quality multilingual data with a human touch for machine learning
- Figure Eight CrowdFlower AI - build a state-of-the-art machine learning model trained with human labeled data