Difference between revisions of "Datasets"

Revision as of 12:50, 7 September 2020

AI Governance
- Data Governance
  - Data Science
  - Master Data Management (MDM) / Feature Store / Data Lineage / Data Catalog
  - Managed Vocabularies
  - Benchmarks
  - Batch Norm(alization) & Standardization
  - Data Preprocessing
  - Feature Exploration/Learning
  - Data Augmentation, Data Labeling, and Auto-Tagging
Hyperparameters
Visualization
- Facets | Google...contains two robust Visualizations to aid in understanding and analyzing machine learning datasets.
OpenML datasets
Datasets and Machine Learning | Chris Nicholson - A.I. Wiki pathmind
Datasets used in deep learning applications within X-ray security imaging | Towards Automatic Threat Detection: A Survey of Advances of Deep Learning within X-ray Security Imaging | Samet Akcay and Toby P. Breckon - Durham University, UK

Datasets (often in combination with algorithms) are becoming more important themselves and can sometimes be seen as the primary intellectual output of the research. The revelations about Cambridge Analytica highlights the importance of datasets and data collection. Reference also: Privacy in Data Science

Sources

Tatoeba a collection of sentences and translations - Tab-delimited Bilingual Sentence Pairs
Kaggle Datasets
COVID-19 Open Research Dataset (CORD-19) ...COVID-19
UC Irvine Machine Learning Repository
MNIST database
Collections | DataHub
Registry of Open Data on AWS | Amazon
Public Data | Google
BigQuery public datasets | Google
Open Images | Google
Data Science for Research | Microsoft
Datasets for Data Mining and Data Science | KDnuggets
Enigma Public
A Comprehensive List of Open Data Portals from Around the World | DataPortals.org
OpenDataSoft
World Data Atlas | Knoema
The Open Machine Learning project | OpenML.org
World's Free Online Data | Research Pipeline
List of datasets for machine learning research | Wikipedia
Neural Net Repository | Wolfram
Open Data for Deep Learning & Machine Learning | 4j
Data Catalog | Data.gov
3D-Machine-Learning | GitHub
- 3D Models
- 3D Scenes
Wind Turbine Map and Database | USGS & DOE
Autosomal DNA
Pascal Visual Object Classes Challenge (VOC)
OpenNASA
Data: Close encounters between two objects |European Space Agency (ESA)
JASA Data Archive | Journal of the American Statistical Association
Datasets Archive | Journal of the American Statistical Association
Data.World
The Dataset Collection | Archive.org
Collections |Archive-it.org
Eurostat | EU statistical office
Re3data
Resource on data and metadata standards - open research data | FAIRsharing
List of Public Data Sources Fit for Machine Learning | bigml
Open Datasets | Skymind
Global Health Observatory resources | World Health Organization (WHO)
CDC WONDER | Center for Disease Control (CDC)
US health insurance program | Medicare
International economy |International Monetary Fund (IMF)
Data Catalog }| The World Bank
Financial and economic | Quandl
- Alternative data | Quandl
PublicDomains | GitHub
datasets and related content | BuzzFeed - GitHub
Sports, politics, economics, and other spheres of life | FiveThirtyEight
EMBER; benign and malicious Windows-portable executable files | Endgame - GitHub
r/datasets | reddit
Microsoft Information-Seeking Conversation (MISC) - audio and video signals; transcripts of conversation
Language-Independent Named Entity Recognition (II)
VGG | Oxford
Perfect-500K beauty and personal care
Mozilla’s Common Voice project collect human voices
CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset. | A. Krizhevsky, V. Nair, and G. Hinton - Canadian Institute For Advanced Research]

Networks

Articles

"ImageNet: Where Have We Been? Where Are We Going?" with Fei-Fei Li Date: 9/21/2017 It took nature and evolution more than 500 million years to develop a powerful visual system in humans. The journey for AI and computer vision is about half of a century. In this talk, Dr. Li will briefly discuss the key ideas and the cutting edge advances in the quest for visual intelligences in computers, focusing on work done to develop ImageNet over the years. Fei-Fei Li is currently on sabbatical as the Chief Scientist of AI/ML at Google Cloud. She is an Associate Professor in the Computer Science Department at Stanford, and the Director of the Stanford Artificial Intelligence Lab. Her main research areas are in machine learning, deep learning, computer vision, and cognitive and computational neuroscience. She has published more than 150 scientific articles in top-tier journals and conferences, including Nature, PNAS, Journal of Neuroscience, CVPR, ICCV, NIPS, ECCV, IJCV, IEEE-PAMI, etc. Li obtained her B.A. degree in physics from Princeton with High Honors, and her Ph.D. degree in electrical engineering from the California Institute of Technology (Caltech). She joined Stanford in 2009 as an assistant professor, and was promoted to associate professor with tenure in 2012.

GCP Public Datasets Program: Share and analyze large-scale global datasets (Google Cloud Next '17) Publicly available large datasets hold great potential to better the world. In this video, Felipe Hoffa introduces the Public Datasets Program by Google Cloud Platform. The program gives dataset owners a terrific platform to share their data, so that users across the world can easily leverage these datasets for large-scale analytics. You'll learn how you can participate in this program, whether you want to broadly share your data or hope to glean insights from large public datasets. Missed the conference? Watch all the talks here: http://goo.gl/c1Vs3h Watch more talks about Big Data & Machine Learning here: http://goo.gl/OcqI9k

P2 How to download a Kaggle dataset & Install Numpy, Pandas, and more - Multiple Linear Regression What’s up yall! We are back again. How was your weekend? After yesterday's introductory episode we are jumping straight in to the nitty gritty of multiple linear regression. But first, let's do some preparation.

Accessing public datasets on Amazon S3 using Globus Demonstrates how you can easily access and download big datasets from public repositories using Globus for Amazon S3

AWS re:Invent 2017: Migrating Databases and Data Warehouses to the Cloud: Getting St (DAT317) In this introductory session, we look at how to convert and migrate your commercial databases and data warehouses to the cloud and gain your database freedom. Amazon AWS Database Migration Service (AWS DMS) and AWS Schema Conversion Tool (AWS SCT) have been used to migrate tens of thousands of databases. These include Oracle and SQL Server to Amazon Aurora, Teradata and Netezza to Amazon Redshift, MongoDB to Amazon DynamoDB, and many other data source and target combinations. Learn how to easily and securely migrate your data and procedural code, enjoy flexibility and cost savings, and gain new opportunities.

Joining Datasets \| Intro to Azure ML Part 6 Last time we prepared our dataset for a join. In this video we’ll use the join data module inside of Azure ML to cross reference each airport id with the airport table to find airport city, airport state, and airport name. We will briefly go over the different types of joins, then combine the three tables together. Each time we join we will add 3 columns to our dataset.

Deep learning idea for creating datasets An idea to easily take snapshots or crops of images to break larger images into nice labled images for a database

VRmeta: Generating AI datasets one precise meta-tag at a time VRmeta is the world's most precise means of adding time-based descriptive metadata to both digital and immersive video. Whether the goal is to create unmatched discoverability for your entire video library, leverage metrics from all that amazing content or license those clips to increase their inbound revenue - VRmeta makes it happen Today’s consumers have more choices than ever before for video entertainment and viewing platforms With this explosion of choice has come complexity. Finding engaging entertainment has become a time consuming and frustrating, resulting in declining engagement and viewer satisfaction The key to overcoming this discovery challenge lies in rich, time-based descriptive metadata VRmeta is your gateway to making this happen: VRmeta's patent-pending cross-hair and tactile navigation technology gives users the most precise means of applying metadata ever created VRmeta gives every clip it touches time and in-frame location data registered with in and out points, all saved into .csv and .xmp sidecar files VRmeta delivers AI precision now. VRmeta even learns your tagging vocabulary, offering users auto-completion for frequently used words and names By applying time-based descriptive metadata at the production level, stakeholders create additional value at every stage of the video content lifecycle VRmeta stands firmly at the nexus of artificial intelligence and healthcare, and is a recognized state-of-the-art solution central to the development of emotional AI datasets The science surrounding sentiment analysis involves natural language processing or linguistic algorithms that assign values to positive, negative or neutral text (converting supposition into monetizable data silos). VRmeta is the ideal method for inputting this data VRmeta is the tool of choice for broadcasters looking to develop information rich, statistical data silos for any variety of sports. Think team and player performance aggregate, post-game data and deep dive statistic development "Great content without accurate metadata is, after all, a missed opportunity"

Open Data Innovation: Building on Open Data Sets for Innovative Applications An overarching conversation on open data innovation. The session highlights how democratizing access to information drives innovation and greater impact. Learn how organizations are using the cloud to gather data and discover insights to foster innovation, improve service delivery and address big societal problems. As data becomes more widely available (GIS, weather, research), having access to scalable technology and the multiple data sources that can feed into the technology solution can help create solutions for significant problems in the world. This session highlights real-world examples of how open data is enabling transformative innovation. Explore how the new Landsat open data set on AWS is spurring innovation among public and private entities or delivering applications to citizens and users.

HH10 BB10

HH1 BB1

Question Answering Beyond SQuAD: Larger Datasets and New Domains, with Branden Chan, deepset.ai Branden Chan, an NLP Engineer at deepset.ai in Berlin, presents on Question Answering Beyond SQuAD: Larger Datasets and New Domains in an online program, May 26, 2020, organized and moderated by Seth Grimes for the New York Natural Language Processing meetup (https://www.meetup.com/NY-NLP) and partners.

HH1 BB1

HH2 BB2

Human in the Loop...
- Amazon Mechanical Turk (MTurk) - Using MTurk with Amazon SageMaker for Supervised Learning (ML)
- Gengo.ai - high-quality multilingual data with a human touch for machine learning
- Figure Eight CrowdFlower AI - build a state-of-the-art machine learning model trained with human labeled data

@@ Line 107: / Line 107: @@
 ||
 <youtube>jYvBmJo7qjc</youtube>
-<b>HH1
+<b>"ImageNet: Where Have We Been? Where Are We Going?" with [[Creatives#Fei-Fei Li |Fei-Fei Li]]
-</b><br>BB1
+</b><br>Date: 9/21/2017   It took nature and evolution more than 500 million years to develop a powerful visual system in humans. The journey for AI and computer vision is about half of a century. In this talk, [[Creatives#Fei-Fei Li |Dr. Li]] will briefly discuss the key ideas and the cutting edge advances in the quest for visual intelligences in computers, focusing on work done to develop ImageNet over the years.  [[Creatives#Fei-Fei Li |Fei-Fei Li]] is currently on sabbatical as the Chief Scientist of AI/ML at Google Cloud. She is an Associate Professor in the Computer Science Department at Stanford, and the Director of the Stanford Artificial Intelligence Lab. Her main research areas are in machine learning, deep learning, computer vision, and cognitive and computational neuroscience. She has published more than 150 scientific articles in top-tier journals and conferences, including Nature, PNAS, Journal of Neuroscience, CVPR, ICCV, NIPS, ECCV, IJCV, IEEE-PAMI, etc. Li obtained her B.A. degree in physics from Princeton with High Honors, and her Ph.D. degree in electrical engineering from the California Institute of Technology (Caltech). She joined Stanford in 2009 as an assistant professor, and was promoted to associate professor with tenure in 2012.
 |}
 |<!-- M -->
@@ Line 115: / Line 115: @@
 ||
 <youtube>dGM1mgkIayY</youtube>
-<b>HH2
+<b>GCP Public Datasets Program: Share and analyze large-scale global datasets ([[Google]] Cloud Next '17)
-</b><br>BB2
+</b><br>Publicly available large datasets hold great potential to better the world. In this video, Felipe Hoffa introduces the Public Datasets Program by [[Google]] Cloud Platform. The program gives dataset owners a terrific platform to share their data, so that users across the world can easily leverage these datasets for large-scale analytics. You'll learn how you can participate in this program, whether you want to broadly share your data or hope to glean insights from large public datasets.
+Missed the conference? Watch all the talks here: http://goo.gl/c1Vs3h  Watch more talks about Big Data & Machine Learning here: http://goo.gl/OcqI9k
 |}
 |}<!-- B -->
@@ Line 124: / Line 125: @@
 ||
 <youtube>tTjZqz6qk1s</youtube>
-<b>HH3
+<b>P2 How to download a Kaggle dataset & Install Numpy, Pandas, and more - Multiple Linear Regression
-</b><br>BB3
+</b><br>What’s up yall! We are back again. How was your weekend? After yesterday's introductory episode we are jumping straight in to the nitty gritty of multiple linear regression. But first, let's do some preparation.
 |}
 |<!-- M -->
@@ Line 132: / Line 133: @@
 ||
 <youtube>uG5dAhNQgDU</youtube>
-<b>HH4
+<b>Accessing public datasets on [[Amazon]] S3 using Globus
-</b><br>BB4
+</b><br>Demonstrates how you can easily access and download big datasets from public repositories using Globus for [[Amazon]] S3
 |}
 |}<!-- B -->
@@ Line 141: / Line 142: @@
 ||
 <youtube>Y33TviLMBFY</youtube>
-<b>HH5
+<b>AWS re:Invent 2017: Migrating Databases and Data Warehouses to the Cloud: Getting St (DAT317)
-</b><br>BB5
+</b><br>In this introductory session, we look at how to convert and migrate your commercial databases and data warehouses to the cloud and gain your database freedom. [[Amazon]] AWS Database Migration Service (AWS DMS) and AWS Schema Conversion Tool (AWS SCT) have been used to migrate tens of thousands of databases. These include Oracle and SQL Server to [[Amazon]] Aurora, Teradata and Netezza to Amazon Redshift, MongoDB to [[Amazon]] DynamoDB, and many other data source and target combinations. Learn how to easily and securely migrate your data and procedural code, enjoy flexibility and cost savings, and gain new opportunities.
 |}
 |<!-- M -->
@@ Line 149: / Line 150: @@
 ||
 <youtube>mLOIPTbgqUg</youtube>
-<b>HH6
+<b>Joining Datasets | Intro to Azure ML Part 6
-</b><br>BB6
+</b><br>Last time we prepared our dataset for a join. In this video we’ll use the join data module inside of Azure ML to cross reference each airport id with the airport table to find airport city, airport state, and airport name. We will briefly go over the different types of joins, then combine the three tables together. Each time we join we will add 3 columns to our dataset.
 |}
 |}<!-- B -->
@@ Line 158: / Line 159: @@
 ||
 <youtube>FDouW7fSIms</youtube>
-<b>HH7
+<b>Deep learning idea for creating datasets
-</b><br>BB7
+</b><br>An idea to easily take snapshots or crops of images to break larger images into nice labled images for a database
 |}
 |<!-- M -->
@@ Line 166: / Line 167: @@
 ||
 <youtube>av8zkywXHSM</youtube>
-<b>HH8
+<b>VRmeta: Generating AI datasets one precise meta-tag at a time
-</b><br>BB8
+</b><br>VRmeta is the world's most precise means of adding time-based descriptive metadata to both digital and immersive video. Whether the goal is to create unmatched discoverability for your entire video library, leverage metrics from all that amazing content or license those clips to increase their inbound revenue - VRmeta makes it happen Today’s consumers have more choices than ever before for video entertainment and viewing platforms  With this explosion of choice has come complexity. Finding engaging entertainment has become a time consuming and frustrating, resulting in declining engagement and viewer satisfaction  The key to overcoming this discovery challenge lies in rich, time-based descriptive metadata VRmeta is your gateway to making this happen: VRmeta's patent-pending cross-hair and tactile navigation technology gives users the most precise means of applying metadata ever created VRmeta gives every clip it touches time and in-frame location data registered with in and out points, all saved into .csv  and .xmp sidecar files VRmeta delivers AI precision now. VRmeta even learns your tagging vocabulary, offering users auto-completion for frequently used words and names  By applying time-based descriptive metadata at the production level, stakeholders create additional value at every stage of the video content lifecycle  VRmeta stands firmly at the nexus of artificial intelligence and healthcare, and is a recognized state-of-the-art solution central to the development of emotional AI datasets  The science surrounding sentiment analysis involves natural language processing or linguistic algorithms that assign values to positive, negative or neutral text (converting supposition into monetizable data silos).  VRmeta is the ideal method for inputting this data  VRmeta is the tool of choice for broadcasters looking to develop information rich, statistical data silos for any variety of sports. Think team and player performance aggregate, post-game data and deep dive statistic development "Great content without accurate metadata is, after all, a missed opportunity"
 |}
 |}<!-- B -->
@@ Line 175: / Line 176: @@
 ||
 <youtube>WhYrLA-fOK0</youtube>
-<b>HH9
+<b>Open Data Innovation: Building on Open Data Sets for Innovative Applications
-</b><br>BB9
+</b><br>An overarching conversation on open data innovation. The session highlights how democratizing access to information drives innovation and greater impact. Learn how organizations are using the cloud to gather data and discover insights to foster innovation, improve service delivery and address big societal problems. As data becomes more widely available (GIS, weather, research), having access to scalable technology and the multiple data sources that can feed into the technology solution can help create solutions for significant problems in the world. This session highlights real-world examples of how open data is enabling transformative innovation. Explore how the new Landsat open data set on AWS is spurring innovation among public and private entities or delivering applications to citizens and users.
 |}
 |<!-- M -->

Difference between revisions of "Datasets"

Revision as of 12:50, 7 September 2020

Sources

Networks

Articles

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools