Difference between revisions of "Datasets"
m (→Sources) |
m |
||
| Line 5: | Line 5: | ||
|description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools | |description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools | ||
}} | }} | ||
| − | [ | + | [https://www.youtube.com/results?search_query=training+datasets YouTube search...] |
| − | [ | + | [https://www.google.com/search?q=datasets+training+deep+machine+learning+artificial+intelligence+ML+AI ...Google search] |
* [[AI Governance]] / [[Algorithm Administration]] | * [[AI Governance]] / [[Algorithm Administration]] | ||
| Line 27: | Line 27: | ||
** [[Evaluation - Measures]] | ** [[Evaluation - Measures]] | ||
* [[Train, Validate, and Test]] | * [[Train, Validate, and Test]] | ||
| − | * [ | + | * [https://www.openml.org/search?type=data OpenML datasets] |
| − | * [ | + | * [https://pathmind.com/wiki/datasets-ml Datasets and Machine Learning | Chris Nicholson - A.I. Wiki pathmind] |
| − | * [ | + | * [https://paperswithcode.com/paper/towards-automatic-threat-detection-a-survey/review/ Datasets used in deep learning applications within X-ray security imaging | Towards Automatic Threat Detection: A Survey of Advances of Deep Learning within X-ray Security Imaging | Samet Akcay and Toby P. Breckon - Durham University, UK] |
| − | Datasets (often in combination with algorithms) are becoming more important themselves and can sometimes be seen as the primary intellectual output of the research. The revelations about [ | + | Datasets (often in combination with algorithms) are becoming more important themselves and can sometimes be seen as the primary intellectual output of the research. The revelations about [https://news.google.com/topics/CAAqKAgKIiJDQkFTRXdvTkwyY3ZNVEZqYkd4cWMyMDNOQklDWlc0b0FBUAE Cambridge Analytica] highlights the importance of datasets and data collection. Reference also: [[Privacy]] |
== Sources == | == Sources == | ||
| − | * [ | + | * [https://mlcommons.org/en/ MLCommons] ...[https://techcrunch.com/2020/12/03/mlcommons-debuts-first-public-database-for-ai-researchers-with-86000-hours-of-speech/ MLCommons debuts with public 86,000-hour speech data set for AI researchers | Devin Coldewey - TechCrunch] |
| − | * [ | + | * [https://quac.ai/ Question Answering in Context (QuAC)] ...Question Answering in Context for modeling, understanding, and participating in information seeking dialog. |
| − | * [ | + | * [https://tatoeba.org/eng Tatoeba] a collection of sentences and translations - [https://www.manythings.org/anki/ Tab-delimited Bilingual Sentence Pairs] |
| − | * [ | + | * [https://www.kaggle.com/datasets Kaggle Datasets] |
| − | * [ | + | * [https://pages.semanticscholar.org/coronavirus-research COVID-19 Open Research Dataset (CORD-19)] ...[[COVID-19]] |
* [https://archive.ics.uci.edu/ml/datasets.php?format=&task=reg&att=num&area=&numAtt=10to100&numIns=&type=&sort=typeUp&view=list UC Irvine Machine Learning Repository] | * [https://archive.ics.uci.edu/ml/datasets.php?format=&task=reg&att=num&area=&numAtt=10to100&numIns=&type=&sort=typeUp&view=list UC Irvine Machine Learning Repository] | ||
| − | * [ | + | * [https://yann.lecun.com/exdb/mnist/ MNIST database] |
| − | * [ | + | * [https://datahub.io/collections Collections | DataHub] |
| − | * [ | + | * [https://registry.opendata.aws/ Registry of Open Data on AWS | Amazon] |
| − | * [ | + | * [https://www.google.com/publicdata/directory Public Data | Google] |
| − | * [ | + | * [https://cloud.google.com/bigquery/public-data/ BigQuery public datasets | Google] |
| − | * [ | + | * [https://storage.googleapis.com/openimages/web/index.html Open Images | Google] |
| − | * [ | + | * [https://www.microsoft.com/en-us/research/academic-program/data-science-microsoft-research/ Data Science for Research | Microsoft] |
| − | * [ | + | * [https://www.kdnuggets.com/datasets/index.html Datasets for Data Mining and Data Science | KDnuggets] |
| − | * [ | + | * [https://public.enigma.com/ Enigma Public] |
| − | * [ | + | * [https://dataportals.org/ A Comprehensive List of Open Data Portals from Around the World | DataPortals.org] |
| − | * [ | + | * [https://www.opendatasoft.com/a-comprehensive-list-of-all-open-data-portals-around-the-world/ OpenDataSoft] |
| − | * [ | + | * [https://knoema.com/atlas/sources World Data Atlas | Knoema] |
| − | * [ | + | * [https://www.openml.org/search?type=data The Open Machine Learning project | OpenML.org] |
| − | * [ | + | * [https://www.researchpipeline.com/mediawiki/index.php?title=Main_Page World's Free Online Data | Research Pipeline] |
| − | * [ | + | * [https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research List of datasets for machine learning research | Wikipedia] |
| − | * [ | + | * [https://resources.wolframcloud.com/NeuralNetRepository Neural Net Repository | Wolfram] |
| − | * [ | + | * [https://deeplearning4j.org/opendata Open Data for Deep Learning & Machine Learning | 4j] |
| − | * [ | + | * [https://catalog.data.gov/dataset Data Catalog | Data.gov] |
| − | * [ | + | * [https://github.com/timzhang642/3D-Machine-Learning#datasets 3D-Machine-Learning | GitHub] |
| − | ** [ | + | ** [https://github.com/timzhang642/3D-Machine-Learning#3d_models 3D Models] |
| − | ** [ | + | ** [https://github.com/timzhang642/3D-Machine-Learning#3d_scenes 3D Scenes] |
| − | * [ | + | * [https://www.usgs.gov/news/us-geological-survey-and-us-department-energy-release-online-public-dataset-and-viewer-us-wind Wind Turbine Map and Database | USGS & DOE] |
| − | * [ | + | * [https://isogg.org/wiki/Autosomal_DNA_testing_comparison_chart Autosomal DNA] |
| − | * [ | + | * [https://host.robots.ox.ac.uk/pascal/VOC Pascal Visual Object Classes Challenge (VOC)] |
| − | * [ | + | * [https://open.nasa.gov/ OpenNASA] |
| − | * [ | + | * [https://kelvins.esa.int/collision-avoidance-challenge/data/ Data: Close encounters between two objects |][https://www.esa.int/ European Space Agency (ESA)] |
| − | * [ | + | * [https://lib.stat.cmu.edu/jasadata/ JASA Data Archive | Journal of the American Statistical Association] |
| − | * [ | + | * [https://lib.stat.cmu.edu/datasets/ Datasets Archive | Journal of the American Statistical Association] |
| − | * [ | + | * [https://data.world/ Data.World] |
| − | * [ | + | * [https://archive.org/details/datasets The Dataset Collection | Archive.org] |
| − | * [ | + | * [https://www.archive-it.org/explore?show=Collections Collections |Archive-it.org] |
| − | * [ | + | * [https://ec.europa.eu/eurostat/data/database Eurostat | EU statistical office] |
| − | * [ | + | * [https://www.re3data.org/ Re3data] |
| − | * [ | + | * [https://fairsharing.org/ Resource on data and metadata standards - open research data | FAIRsharing] |
| − | * [ | + | * [https://blog.bigml.com/list-of-public-data-sources-fit-for-machine-learning/ List of Public Data Sources Fit for Machine Learning | bigml] |
| − | * [ | + | * [https://skymind.ai/wiki/open-datasets Open Datasets | Skymind] |
| − | * [ | + | * [https://apps.who.int/gho/data/node.resources Global Health Observatory resources | World Health Organization (WHO)] |
| − | * [ | + | * [https://wonder.cdc.gov/Welcome.html CDC WONDER | Center for Disease Control (CDC)] |
| − | * [ | + | * [https://data.medicare.gov/ US health insurance program | Medicare] |
| − | * [ | + | * [https://data.imf.org International economy |International Monetary Fund (IMF)] |
| − | * [ | + | * [https://datacatalog.worldbank.org/search/datasets Data Catalog }| The World Bank] |
| − | * [ | + | * [https://www.quandl.com/ Financial and economic | Quandl] |
| − | ** [ | + | ** [https://www.quandl.com/alternative-data Alternative data | Quandl] |
| − | * [ | + | * [https://github.com/awesomedata/awesome-public-datasets#publicdomains PublicDomains | GitHub] |
| − | * [ | + | * [https://github.com/BuzzFeedNews/everything datasets and related content | BuzzFeed - GitHub] |
| − | * [ | + | * [https://data.fivethirtyeight.com/ Sports, politics, economics, and other spheres of life | FiveThirtyEight] |
| − | * [ | + | * [https://github.com/endgameinc/ember EMBER; benign and malicious Windows-portable executable files | Endgame - GitHub] |
| − | * [ | + | * [https://www.reddit.com/r/datasets/ r/datasets | reddit] |
| − | * [ | + | * [https://www.microsoft.com/en-us/download/details.aspx?id=55594&WT.mc_id=rss_alldownloads_all Microsoft Information-Seeking Conversation (MISC)] - audio and video signals; transcripts of conversation |
| − | * [ | + | * [https://www.clips.uantwerpen.be/conll2003/ner/ Language-Independent Named Entity Recognition (II)] |
| − | * [ | + | * [https://www.robots.ox.ac.uk/~vgg/data/vgg_face/ VGG | Oxford] |
| − | * [ | + | * [https://challenge2019.perfectcorp.com/ Perfect-500K] beauty and personal care |
| − | * [ | + | * [https://voice.mozilla.org/en Mozilla’s Common Voice project] collect human voices |
| − | * [ | + | * [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10 and CIFAR-100] are labeled subsets of the 80 million tiny images dataset. | A. Krizhevsky, V. Nair, and G. Hinton - Canadian Institute For Advanced Research] |
== Networks == | == Networks == | ||
* [[Bidirectional Encoder Representations from Transformers (BERT)]] | * [[Bidirectional Encoder Representations from Transformers (BERT)]] | ||
* [[ResNet-50]] | * [[ResNet-50]] | ||
| − | * [ | + | * [https://en.wikipedia.org/wiki/ImageNet ImageNet | Wikipedia] |
| − | * [ | + | * [https://en.wikipedia.org/wiki/AlexNet AlexNet | Wikipedia] |
| − | * [ | + | * [https://wordnet.princeton.edu/ WordNet] |
== Articles == | == Articles == | ||
| − | * [ | + | * [https://www.forbes.com/sites/korihale/2019/06/25/microsoft-scraps-10-million-facial-recognition-photos-on-the-low/#6672d61949f2 Microsoft Scraps 10 Million Facial Recognition Photos On The Low | Kori Hale -Forbes] |
| − | * [ | + | * [https://gengo.ai/datasets/the-50-best-free-datasets-for-machine-learning/ The 50 Best Free Datasets for Machine Learning | Meiryum Ali - Gengo AI] |
| − | * [ | + | * [https://medium.com/datadriveninvestor/the-50-best-public-datasets-for-machine-learning-d80e9f030279 The 50 Best Public Datasets for Machine Learning | Stacy Stanford - Medium] |
| − | * [ | + | * [https://www.altexsoft.com/blog/datascience/best-public-machine-learning-datasets/ Best Public Datasets for Machine Learning and Data Science: Sources and Advice on the Choice | Altexsoft] |
| − | * [ | + | * [https://www.analyticsvidhya.com/blog/2018/03/comprehensive-collection-deep-learning-datasets/ 25 Open Datasets for Deep Learning Every Data Scientist Must Work With | PRANAV DAR - Analytics Vidhya] |
{|<!-- T --> | {|<!-- T --> | ||
| Line 125: | Line 125: | ||
<b>GCP Public Datasets Program: Share and analyze large-scale global datasets ([[Google]] Cloud Next '17) | <b>GCP Public Datasets Program: Share and analyze large-scale global datasets ([[Google]] Cloud Next '17) | ||
</b><br>Publicly available large datasets hold great potential to better the world. In this video, Felipe Hoffa introduces the Public Datasets Program by [[Google]] Cloud Platform. The program gives dataset owners a terrific platform to share their data, so that users across the world can easily leverage these datasets for large-scale analytics. You'll learn how you can participate in this program, whether you want to broadly share your data or hope to glean insights from large public datasets. | </b><br>Publicly available large datasets hold great potential to better the world. In this video, Felipe Hoffa introduces the Public Datasets Program by [[Google]] Cloud Platform. The program gives dataset owners a terrific platform to share their data, so that users across the world can easily leverage these datasets for large-scale analytics. You'll learn how you can participate in this program, whether you want to broadly share your data or hope to glean insights from large public datasets. | ||
| − | Missed the conference? Watch all the talks here: | + | Missed the conference? Watch all the talks here: https://goo.gl/c1Vs3h Watch more talks about Big Data & Machine Learning here: https://goo.gl/OcqI9k |
|} | |} | ||
|}<!-- B --> | |}<!-- B --> | ||
| Line 231: | Line 231: | ||
|}<!-- B --> | |}<!-- B --> | ||
| − | * [ | + | * [https://www.quora.com/What-are-the-alternatives-to-CrowdFlower Human in the Loop...] |
| − | ** [ | + | ** [https://www.mturk.com/ Amazon Mechanical Turk (MTurk)] - [https://blog.mturk.com/using-mturk-with-amazon-sagemaker-for-supervised-learning-ml-bc30f94e1c0d Using MTurk with Amazon SageMaker for Supervised Learning (ML)] |
| − | ** [ | + | ** [https://gengo.ai/ Gengo.ai] - high-quality multilingual data with a human touch for machine learning |
| − | ** [ | + | ** [https://visit.figure-eight.com/crowdflower-ai-info-old.html Figure Eight CrowdFlower AI] - build a state-of-the-art machine learning model trained with human labeled data |
Revision as of 20:53, 28 January 2023
YouTube search... ...Google search
- AI Governance / Algorithm Administration
- Visualization
- Facets | Google...contains two robust Visualizations to aid in understanding and analyzing machine learning datasets.
- Hyperparameters
- Evaluation
- Train, Validate, and Test
- OpenML datasets
- Datasets and Machine Learning | Chris Nicholson - A.I. Wiki pathmind
- Datasets used in deep learning applications within X-ray security imaging | Towards Automatic Threat Detection: A Survey of Advances of Deep Learning within X-ray Security Imaging | Samet Akcay and Toby P. Breckon - Durham University, UK
Datasets (often in combination with algorithms) are becoming more important themselves and can sometimes be seen as the primary intellectual output of the research. The revelations about Cambridge Analytica highlights the importance of datasets and data collection. Reference also: Privacy
Sources
- MLCommons ...MLCommons debuts with public 86,000-hour speech data set for AI researchers | Devin Coldewey - TechCrunch
- Question Answering in Context (QuAC) ...Question Answering in Context for modeling, understanding, and participating in information seeking dialog.
- Tatoeba a collection of sentences and translations - Tab-delimited Bilingual Sentence Pairs
- Kaggle Datasets
- COVID-19 Open Research Dataset (CORD-19) ...COVID-19
- UC Irvine Machine Learning Repository
- MNIST database
- Collections | DataHub
- Registry of Open Data on AWS | Amazon
- Public Data | Google
- BigQuery public datasets | Google
- Open Images | Google
- Data Science for Research | Microsoft
- Datasets for Data Mining and Data Science | KDnuggets
- Enigma Public
- A Comprehensive List of Open Data Portals from Around the World | DataPortals.org
- OpenDataSoft
- World Data Atlas | Knoema
- The Open Machine Learning project | OpenML.org
- World's Free Online Data | Research Pipeline
- List of datasets for machine learning research | Wikipedia
- Neural Net Repository | Wolfram
- Open Data for Deep Learning & Machine Learning | 4j
- Data Catalog | Data.gov
- 3D-Machine-Learning | GitHub
- Wind Turbine Map and Database | USGS & DOE
- Autosomal DNA
- Pascal Visual Object Classes Challenge (VOC)
- OpenNASA
- Data: Close encounters between two objects |European Space Agency (ESA)
- JASA Data Archive | Journal of the American Statistical Association
- Datasets Archive | Journal of the American Statistical Association
- Data.World
- The Dataset Collection | Archive.org
- Collections |Archive-it.org
- Eurostat | EU statistical office
- Re3data
- Resource on data and metadata standards - open research data | FAIRsharing
- List of Public Data Sources Fit for Machine Learning | bigml
- Open Datasets | Skymind
- Global Health Observatory resources | World Health Organization (WHO)
- CDC WONDER | Center for Disease Control (CDC)
- US health insurance program | Medicare
- International economy |International Monetary Fund (IMF)
- Data Catalog }| The World Bank
- Financial and economic | Quandl
- PublicDomains | GitHub
- datasets and related content | BuzzFeed - GitHub
- Sports, politics, economics, and other spheres of life | FiveThirtyEight
- EMBER; benign and malicious Windows-portable executable files | Endgame - GitHub
- r/datasets | reddit
- Microsoft Information-Seeking Conversation (MISC) - audio and video signals; transcripts of conversation
- Language-Independent Named Entity Recognition (II)
- VGG | Oxford
- Perfect-500K beauty and personal care
- Mozilla’s Common Voice project collect human voices
- CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset. | A. Krizhevsky, V. Nair, and G. Hinton - Canadian Institute For Advanced Research]
Networks
- Bidirectional Encoder Representations from Transformers (BERT)
- ResNet-50
- ImageNet | Wikipedia
- AlexNet | Wikipedia
- WordNet
Articles
- Microsoft Scraps 10 Million Facial Recognition Photos On The Low | Kori Hale -Forbes
- The 50 Best Free Datasets for Machine Learning | Meiryum Ali - Gengo AI
- The 50 Best Public Datasets for Machine Learning | Stacy Stanford - Medium
- Best Public Datasets for Machine Learning and Data Science: Sources and Advice on the Choice | Altexsoft
- 25 Open Datasets for Deep Learning Every Data Scientist Must Work With | PRANAV DAR - Analytics Vidhya
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- Human in the Loop...
- Amazon Mechanical Turk (MTurk) - Using MTurk with Amazon SageMaker for Supervised Learning (ML)
- Gengo.ai - high-quality multilingual data with a human touch for machine learning
- Figure Eight CrowdFlower AI - build a state-of-the-art machine learning model trained with human labeled data