Difference between revisions of "Datasets"

From
Jump to: navigation, search
m
 
(100 intermediate revisions by the same user not shown)
Line 1: Line 1:
[http://www.youtube.com/results?search_query=training+datasets YouTube search...]
+
{{#seo:
[http://www.google.com/search?q=datasets+training+deep+learning+artificial+intelligence+&oq=datasets+training+deep+learning+artificial+intelligence+ ...Google search]
+
|title=PRIMO.ai
 +
|titlemode=append
 +
|keywords=ChatGPT, artificial, intelligence, machine, learning, GPT-4, GPT-5, NLP, NLG, NLC, NLU, models, data, singularity, moonshot, Sentience, AGI, Emergence, Moonshot, Explainable, TensorFlow, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Hugging Face, OpenAI, Tensorflow, OpenAI, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Meta, LLM, metaverse, assistants, agents, digital twin, IoT, Transhumanism, Immersive Reality, Generative AI, Conversational AI, Perplexity, Bing, You, Bard, Ernie, prompt Engineering LangChain, Video/Image, Vision, End-to-End Speech, Synthesize Speech, Speech Recognition, Stanford, MIT |description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools 
  
* [[Data Preprocessing & Feature Exploration]]
+
<!-- Google tag (gtag.js) -->
* [[Hyperparameters]]
+
<script async src="https://www.googletagmanager.com/gtag/js?id=G-4GCWLBVJ7T"></script>
* [http://www.kaggle.com/datasets Kaggle Datasets]
+
<script>
* [http://registry.opendata.aws/ Registry of Open Data | on AWS]
+
  window.dataLayer = window.dataLayer || [];
* [http://storage.googleapis.com/openimages/web/index.html Open Images | Google]
+
  function gtag(){dataLayer.push(arguments);}
* [http://www.openml.org/search?type=data The Open Machine Learning project | OpenML.org]
+
  gtag('js', new Date());
* [http://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research Datasets | Wikipedia]
 
* [http://resources.wolframcloud.com/NeuralNetRepository Neural Net Repository | Wolfram]
 
* [http://deeplearning4j.org/opendata Open Data for Deep Learning & Machine Learning | 4j]
 
* [http://www.usgs.gov/news/us-geological-survey-and-us-department-energy-release-online-public-dataset-and-viewer-us-wind Wind Turbine Map and Database | USGS & DOE]
 
* [http://isogg.org/wiki/Autosomal_DNA_testing_comparison_chart Autosomal DNA]
 
* [http://github.com/endgameinc/ember EMBER; benign and malicious Windows-portable executable files | Endgame]
 
* [http://host.robots.ox.ac.uk/pascal/VOC Pascal Visual Object Classes Challenge (VOC)]
 
  
<youtube>koiTTim4M-s</youtube>
+
  gtag('config', 'G-4GCWLBVJ7T');
<youtube>tChcZpBbTTA</youtube>
+
</script>
 +
}}
 +
[https://www.youtube.com/results?search_query=ai+Data+Datasets YouTube]
 +
[https://www.quora.com/search?q=ai%20Data%20Datasets ... Quora]
 +
[https://www.google.com/search?q=ai+Data+Datasets ...Google search]
 +
[https://news.google.com/search?q=ai+Data+Datasets ...Google News]
 +
[https://www.bing.com/news/search?q=ai+Data+Datasets&qft=interval%3d%228%22 ...Bing News]
 +
 
 +
* [[Data Science]] ... [[Data Governance|Governance]] ... [[Data Preprocessing|Preprocessing]] ... [[Feature Exploration/Learning|Exploration]] ... [[Data Interoperability|Interoperability]] ... [[Algorithm Administration#Master Data Management (MDM)|Master Data Management (MDM)]] ... [[Bias and Variances]] ... [[Benchmarks]] ... [[Datasets]]
 +
* [[Excel]] ... [[LangChain#Documents|Documents]] ... [[Database|Database; Vector & Relational]] ... [[Graph]] ... [[LlamaIndex]]
 +
* [[Data Quality]] ...[[AI Verification and Validation|validity]], [[Evaluation - Measures#Accuracy|accuracy]], [[Data Quality#Data Cleaning|cleaning]], [[Data Quality#Data Completeness|completeness]], [[Data Quality#Data Consistency|consistency]], [[Data Quality#Data Encoding|encoding]], [[Data Quality#Zero Padding|padding]], [[Data Quality#Data Augmentation, Data Labeling, and Auto-Tagging|augmentation, labeling, auto-tagging]], [[Data Quality#Batch Norm(alization) & Standardization| normalization, standardization]], and [[Data Quality#Imbalanced Data|imbalanced data]]
 +
* [[Risk, Compliance and Regulation]] ... [[Ethics]] ... [[Privacy]] ... [[Law]] ... [[AI Governance]] ... [[AI Verification and Validation]]
 +
* [[Natural Language Processing (NLP)#Managed Vocabularies |Managed Vocabularies]]
 +
* [[Analytics]] ... [[Visualization]] ... [[Graphical Tools for Modeling AI Components|Graphical Tools]] ... [[Diagrams for Business Analysis|Diagrams]] & [[Generative AI for Business Analysis|Business Analysis]] ... [[Requirements Management|Requirements]] ... [[Loop]] ... [[Bayes]] ... [[Network Pattern]]
 +
* [[Development]] ... [[Notebooks]] ... [[Development#AI Pair Programming Tools|AI Pair Programming]] ... [[Codeless Options, Code Generators, Drag n' Drop|Codeless]] ... [[Hugging Face]] ... [[Algorithm Administration#AIOps/MLOps|AIOps/MLOps]] ... [[Platforms: AI/Machine Learning as a Service (AIaaS/MLaaS)|AIaaS/MLaaS]]
 +
** [[Google Facets| Facets]] | [[Google]]...contains two robust [[Visualization]]s to aid in understanding and analyzing machine learning datasets.
 +
* [[Algorithm Administration#Hyperparameter|Hyperparameter]]s
 +
* [[Strategy & Tactics]] ... [[Project Management]] ... [[Best Practices]] ... [[Checklists]] ... [[Project Check-in]] ... [[Evaluation]] ... [[Evaluation - Measures|Measures]]
 +
* [[AI Solver]] ... [[Algorithms]] ... [[Algorithm Administration|Administration]] ... [[Model Search]] ... [[Discriminative vs. Generative]] ... [[Train, Validate, and Test]]
 +
* [https://www.openml.org/search?type=data OpenML datasets]
 +
* [https://pathmind.com/wiki/datasets-ml Datasets and Machine Learning | Chris Nicholson - A.I. Wiki pathmind]
 +
* [https://paperswithcode.com/paper/towards-automatic-threat-detection-a-survey/review/ Datasets used in deep learning applications within X-ray security imaging | Towards Automatic Threat Detection: A Survey of Advances of Deep Learning within X-ray Security Imaging | Samet Akcay and Toby P. Breckon - Durham University, UK]
 +
 
 +
Datasets (often in combination with algorithms) are becoming more important themselves and can sometimes be seen as the primary intellectual output of the research. The revelations about [https://news.google.com/topics/CAAqKAgKIiJDQkFTRXdvTkwyY3ZNVEZqYkd4cWMyMDNOQklDWlc0b0FBUAE Cambridge Analytica] highlights the importance of datasets and data collection.  Reference also: [[Privacy]] 
 +
 +
== Sources ==
 +
* [https://mlcommons.org/en/ MLCommons] ...[https://techcrunch.com/2020/12/03/mlcommons-debuts-first-public-database-for-ai-researchers-with-86000-hours-of-speech/ MLCommons debuts with public 86,000-hour speech data set for AI researchers | Devin Coldewey - TechCrunch]
 +
* [https://quac.ai/ Question Answering in Context (QuAC)] ...Question Answering in [[context]] for modeling, understanding, and participating in information seeking dialog.
 +
* [https://tatoeba.org/eng Tatoeba] a collection of sentences and translations - [https://www.manythings.org/anki/ Tab-delimited Bilingual Sentence Pairs]
 +
* [https://www.kaggle.com/datasets Kaggle Datasets]
 +
* [https://pages.semanticscholar.org/coronavirus-research COVID-19 Open Research Dataset (CORD-19)]  ...[[COVID-19]]
 +
* [https://archive.ics.uci.edu/ml/datasets.php?format=&task=reg&att=num&area=&numAtt=10to100&numIns=&type=&sort=typeUp&view=list UC Irvine Machine Learning Repository]
 +
* [https://yann.lecun.com/exdb/mnist/ MNIST database]
 +
* [https://datahub.io/collections Collections | DataHub]
 +
* [https://registry.opendata.aws/ Registry of Open Data on AWS | Amazon]
 +
* [https://www.google.com/publicdata/directory Public Data | Google]
 +
* [https://cloud.google.com/bigquery/public-data/ BigQuery public datasets | Google]
 +
* [https://storage.googleapis.com/openimages/web/index.html Open Images | Google]
 +
* [https://www.microsoft.com/en-us/research/academic-program/data-science-microsoft-research/ Data Science for Research | Microsoft]
 +
* [https://www.kdnuggets.com/datasets/index.html Datasets for Data Mining and Data Science | KDnuggets]
 +
* [https://public.enigma.com/ Enigma Public]
 +
* [https://dataportals.org/  A Comprehensive List of Open Data Portals from Around the World | DataPortals.org]
 +
* [https://www.opendatasoft.com/a-comprehensive-list-of-all-open-data-portals-around-the-world/ OpenDataSoft]
 +
* [https://knoema.com/atlas/sources World Data Atlas | Knoema]
 +
* [https://www.openml.org/search?type=data The Open Machine Learning project | OpenML.org]
 +
* [https://www.researchpipeline.com/mediawiki/index.php?title=Main_Page World's Free Online Data | Research Pipeline]
 +
* [https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research List of datasets for machine learning research | Wikipedia]
 +
* [https://resources.wolframcloud.com/NeuralNetRepository Neural Net Repository | Wolfram]
 +
* [https://deeplearning4j.org/opendata Open Data for Deep Learning & Machine Learning | 4j]
 +
* [https://catalog.data.gov/dataset Data Catalog | Data.gov]
 +
* [https://github.com/timzhang642/3D-Machine-Learning#datasets 3D-Machine-Learning | GitHub]
 +
** [https://github.com/timzhang642/3D-Machine-Learning#3d_models 3D Models]
 +
** [https://github.com/timzhang642/3D-Machine-Learning#3d_scenes 3D Scenes]
 +
* [https://www.usgs.gov/news/us-geological-survey-and-us-department-energy-release-online-public-dataset-and-viewer-us-wind Wind Turbine Map and Database | USGS & DOE]
 +
* [https://isogg.org/wiki/Autosomal_DNA_testing_comparison_chart Autosomal DNA]
 +
* [https://host.robots.ox.ac.uk/pascal/VOC Pascal Visual Object Classes Challenge (VOC)]
 +
* [https://open.nasa.gov/ OpenNASA]
 +
* [https://kelvins.esa.int/collision-avoidance-challenge/data/ Data: Close encounters between two objects |][https://www.esa.int/ European Space Agency (ESA)]
 +
* [https://lib.stat.cmu.edu/jasadata/  JASA Data Archive | Journal of the American Statistical Association]
 +
* [https://lib.stat.cmu.edu/datasets/ Datasets Archive | Journal of the American Statistical Association]
 +
* [https://data.world/ Data.World]
 +
* [https://archive.org/details/datasets The Dataset Collection | Archive.org]
 +
* [https://www.archive-it.org/explore?show=Collections Collections |Archive-it.org]
 +
* [https://ec.europa.eu/eurostat/data/database Eurostat | EU statistical office]
 +
* [https://www.re3data.org/ Re3data]
 +
* [https://fairsharing.org/ Resource on data and metadata standards - open research data | FAIRsharing]
 +
* [https://blog.bigml.com/list-of-public-data-sources-fit-for-machine-learning/ List of Public Data Sources Fit for Machine Learning | bigml]
 +
* [https://skymind.ai/wiki/open-datasets Open Datasets | Skymind]
 +
* [https://apps.who.int/gho/data/node.resources Global Health Observatory resources | World Health Organization (WHO)]
 +
* [https://wonder.cdc.gov/Welcome.html CDC WONDER | Center for Disease Control (CDC)]
 +
* [https://data.medicare.gov/ US health insurance program | Medicare]
 +
* [https://data.imf.org International economy |International Monetary Fund (IMF)]
 +
* [https://datacatalog.worldbank.org/search/datasets Data Catalog }| The World Bank]
 +
* [https://www.quandl.com/ Financial and economic  | Quandl]
 +
** [https://www.quandl.com/alternative-data Alternative data | Quandl]
 +
* [https://github.com/awesomedata/awesome-public-datasets#publicdomains PublicDomains | GitHub]
 +
* [https://github.com/BuzzFeedNews/everything datasets and related content | BuzzFeed - GitHub]
 +
* [https://data.fivethirtyeight.com/ Sports, politics, economics, and other spheres of life | FiveThirtyEight]
 +
* [https://github.com/endgameinc/ember EMBER; benign and malicious Windows-portable executable files | Endgame - GitHub]
 +
* [https://www.reddit.com/r/datasets/ r/datasets | reddit]
 +
* [https://www.microsoft.com/en-us/download/details.aspx?id=55594&WT.mc_id=rss_alldownloads_all Microsoft Information-Seeking Conversation (MISC)] - audio and video signals; transcripts of conversation
 +
* [https://www.clips.uantwerpen.be/conll2003/ner/ Language-Independent Named Entity Recognition (II)]
 +
* [https://www.robots.ox.ac.uk/~vgg/data/vgg_face/ VGG | Oxford]
 +
* [https://challenge2019.perfectcorp.com/ Perfect-500K] beauty and personal care
 +
* [https://voice.mozilla.org/en Mozilla’s Common Voice project] collect human voices
 +
* [https://www.cs.toronto.edu/~kriz/cifar.html CIFAR-10 and CIFAR-100] are labeled subsets of the 80 million tiny images dataset. | A. Krizhevsky, V. Nair, and G. Hinton - Canadian Institute For Advanced Research]
 +
 
 +
== Networks ==
 +
* [[Bidirectional Encoder Representations from Transformers (BERT)]]
 +
* [[ResNet-50]]
 +
* [https://en.wikipedia.org/wiki/ImageNet ImageNet | Wikipedia]
 +
* [https://en.wikipedia.org/wiki/AlexNet AlexNet | Wikipedia]
 +
* [https://wordnet.princeton.edu/ WordNet]
 +
 
 +
== Articles ==
 +
* [https://www.forbes.com/sites/korihale/2019/06/25/microsoft-scraps-10-million-facial-recognition-photos-on-the-low/#6672d61949f2 Microsoft Scraps 10 Million Facial Recognition Photos On The Low | Kori Hale -Forbes]
 +
* [https://gengo.ai/datasets/the-50-best-free-datasets-for-machine-learning/  The 50 Best Free Datasets for Machine Learning | Meiryum Ali - Gengo AI]
 +
* [https://medium.com/datadriveninvestor/the-50-best-public-datasets-for-machine-learning-d80e9f030279 The 50 Best Public Datasets for Machine Learning | Stacy Stanford - Medium] 
 +
* [https://www.altexsoft.com/blog/datascience/best-public-machine-learning-datasets/ Best Public Datasets for Machine Learning and Data Science: Sources and Advice on the Choice | Altexsoft]
 +
* [https://www.analyticsvidhya.com/blog/2018/03/comprehensive-collection-deep-learning-datasets/ 25 Open Datasets for Deep Learning Every Data Scientist Must Work With | PRANAV DAR - Analytics Vidhya]
 +
 
 +
{|<!-- T -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 +
<youtube>jYvBmJo7qjc</youtube>
 +
<b>"ImageNet: Where Have We Been? Where Are We Going?" with [[Creatives#Fei-Fei Li |Fei-Fei Li]]
 +
</b><br>Date: 9/21/2017  It took nature and evolution more than 500 million years to develop a powerful visual system in humans. The journey for AI and computer vision is about half of a century. In this talk, [[Creatives#Fei-Fei Li |Dr. Li]] will briefly discuss the key ideas and the cutting edge advances in the quest for visual intelligences in computers, focusing on work done to develop ImageNet over the years.  [[Creatives#Fei-Fei Li |Fei-Fei Li]] is currently on sabbatical as the Chief Scientist of AI/ML at Google Cloud. She is an Associate Professor in the Computer Science Department at Stanford, and the Director of the Stanford Artificial Intelligence Lab. Her main research areas are in machine learning, deep learning, computer vision, and cognitive and computational neuroscience. She has published more than 150 scientific articles in top-tier journals and conferences, including Nature, PNAS, Journal of Neuroscience, CVPR, ICCV, NIPS, ECCV, IJCV, IEEE-PAMI, etc. Li obtained her B.A. degree in physics from Princeton with High Honors, and her Ph.D. degree in electrical engineering from the California Institute of Technology (Caltech). She joined Stanford in 2009 as an [[Assistants|assistant]] professor, and was promoted to associate professor with tenure in 2012. 
 +
|}
 +
|<!-- M -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 
<youtube>dGM1mgkIayY</youtube>
 
<youtube>dGM1mgkIayY</youtube>
 +
<b>GCP Public Datasets Program: Share and analyze large-scale global datasets ([[Google]] Cloud Next '17)
 +
</b><br>Publicly available large datasets hold great potential to better the world. In this video, Felipe Hoffa introduces the Public Datasets Program by [[Google]] Cloud Platform. The program gives dataset owners a terrific platform to share their data, so that users across the world can easily leverage these datasets for large-scale analytics. You'll learn how you can participate in this program, whether you want to broadly share your data or hope to glean insights from large public datasets. 
 +
Missed the conference? Watch all the talks here: https://goo.gl/c1Vs3h  Watch more talks about Big Data & Machine Learning here: https://goo.gl/OcqI9k
 +
|}
 +
|}<!-- B -->
 +
{|<!-- T -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 
<youtube>tTjZqz6qk1s</youtube>
 
<youtube>tTjZqz6qk1s</youtube>
 +
<b>P2 How to download a Kaggle dataset & Install Numpy, Pandas, and more - Multiple Linear Regression
 +
</b><br>What’s up yall! We are back again. How was your weekend? After yesterday's introductory episode we are jumping straight in to the nitty gritty of multiple linear regression. But first, let's do some preparation.
 +
|}
 +
|<!-- M -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 
<youtube>uG5dAhNQgDU</youtube>
 
<youtube>uG5dAhNQgDU</youtube>
 +
<b>Accessing public datasets on [[Amazon]] S3 using Globus
 +
</b><br>Demonstrates how you can easily access and download big datasets from public repositories using Globus for [[Amazon]] S3
 +
|}
 +
|}<!-- B -->
 +
{|<!-- T -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 
<youtube>Y33TviLMBFY</youtube>
 
<youtube>Y33TviLMBFY</youtube>
 +
<b>AWS re:Invent 2017: Migrating [[Database]]s and Data Warehouses to the Cloud: Getting St (DAT317)
 +
</b><br>In this introductory session, we look at how to convert and migrate your commercial [[database]]s and data warehouses to the cloud and gain your [[database]] freedom. [[Amazon]] AWS [[Database]] Migration Service (AWS DMS) and AWS Schema Conversion Tool (AWS SCT) have been used to migrate tens of thousands of [[database]]s. These include Oracle and SQL Server to [[Amazon]] Aurora, Teradata and Netezza to Amazon Redshift, MongoDB to [[Amazon]] DynamoDB, and many other data source and target combinations. Learn how to easily and securely migrate your data and procedural code, enjoy flexibility and cost savings, and gain new opportunities.
 +
|}
 +
|<!-- M -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 
<youtube>mLOIPTbgqUg</youtube>
 
<youtube>mLOIPTbgqUg</youtube>
 +
<b>Joining Datasets | Intro to Azure ML Part 6
 +
</b><br>Last time we prepared our dataset for a join. In this video we’ll use the join data module inside of Azure ML to cross reference each airport id with the airport table to find airport city, airport state, and airport name. We will briefly go over the different types of joins, then combine the three tables together. Each time we join we will add 3 columns to our dataset.
 +
|}
 +
|}<!-- B -->
 +
{|<!-- T -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 
<youtube>FDouW7fSIms</youtube>
 
<youtube>FDouW7fSIms</youtube>
 +
<b>Deep learning idea for creating datasets
 +
</b><br>An idea to easily take snapshots or crops of images to break larger images into nice labled images for a [[database]]
 +
|}
 +
|<!-- M -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 +
<youtube>KoA1lVRwHrc</youtube>
 +
<b>ML #8 - Open Healthcare Datasets
 +
</b><br>Many people want healthcare data to play with, but don't know where to find it. In this chat we'll provide you the data resources you need to start doing machine learning.
 +
|}
 +
|}<!-- B -->
 +
{|<!-- T -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 +
<youtube>WhYrLA-fOK0</youtube>
 +
<b>Open Data Innovation: Building on Open Data Sets for Innovative Applications
 +
</b><br>An overarching conversation on open data innovation. The session highlights how democratizing access to information drives innovation and greater impact. Learn how organizations are using the cloud to gather data and discover insights to foster innovation, improve service delivery and address big societal problems. As data becomes more widely available (GIS, weather, research), having access to scalable technology and the multiple data sources that can feed into the technology solution can help create solutions for significant problems in the world. This session highlights real-world examples of how open data is enabling transformative innovation. Explore how the new Landsat open data set on AWS is spurring innovation among public and private entities or delivering applications to citizens and users.
 +
|}
 +
|<!-- M -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 +
<youtube>av8zkywXHSM</youtube>
 +
<b>VRmeta: Generating AI datasets one precise meta-tag at a time
 +
</b><br>VRmeta is the world's most precise means of adding time-based descriptive metadata to both digital and immersive video. Whether the goal is to create unmatched discoverability for your entire video library, leverage metrics from all that amazing content or license those clips to increase their inbound revenue - VRmeta makes it happen Today’s consumers have more choices than ever before for video entertainment and viewing platforms  With this explosion of choice has come complexity. Finding engaging entertainment has become a time consuming and frustrating, resulting in declining engagement and viewer satisfaction  The key to overcoming this discovery challenge lies in rich, time-based descriptive metadata VRmeta is your gateway to making this happen:​ VRmeta's patent-pending cross-hair and tactile navigation technology gives users the most precise means of applying metadata ever created​ VRmeta gives every clip it touches time and in-frame location data registered with in and out points, all saved into .csv  and .xmp sidecar files​ VRmeta delivers AI precision now. VRmeta even learns your tagging vocabulary, offering users auto-completion for frequently used words and names​  By applying time-based descriptive metadata at the production level, stakeholders create additional value at every stage of the video content lifecycle​  VRmeta stands firmly at the nexus of artificial intelligence and healthcare, and is a recognized state-of-the-art solution central to the [[development]] of emotional AI datasets​  The science surrounding [[Sentiment Analysis]] involves natural language processing or linguistic algorithms that assign values to positive, negative or neutral text (converting supposition into monetizable data silos).  VRmeta is the ideal method for inputting this data​  VRmeta is the tool of choice for broadcasters looking to develop information rich, statistical data silos for any variety of sports. Think team and player performance aggregate, post-game data and deep dive statistic [[development​]] "Great content without accurate metadata is, after all, a missed opportunity"
 +
|}
 +
|}<!-- B -->
 +
{|<!-- T -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 +
<youtube>M1WQxTofGe8</youtube>
 +
<b>Open Data Innovation: Building on Open Data Sets for Innovative Applications
 +
</b><br>An overarching conversation on open data innovation. The session highlights how democratizing access to information drives innovation and greater impact. Learn how organizations are using the cloud to gather data and discover insights to foster innovation, improve service delivery and address big societal problems. As data becomes more widely available (GIS, weather, research), having access to scalable technology and the multiple data sources that can feed into the technology solution can help create solutions for significant problems in the world. This session highlights real-world examples of how open data is enabling transformative innovation. Explore how the new Landsat open data set on AWS is spurring innovation among public and private entities or delivering applications to citizens and users.
 +
|}
 +
|<!-- M -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 +
<youtube>E80qHThomok</youtube>
 +
<b>Question Answering Beyond SQuAD: Larger Datasets and New Domains, with Branden Chan, deepset.ai
 +
</b><br>Branden Chan, an NLP Engineer at deepset.ai in Berlin, presents on Question Answering Beyond SQuAD: Larger Datasets and New Domains in an online program, May 26, 2020, organized and moderated by Seth Grimes for the New York Natural Language Processing meetup (https://www.meetup.com/NY-NLP) and partners.
 +
|}
 +
|}<!-- B -->
 +
{|<!-- T -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 +
<youtube>koiTTim4M-s</youtube>
 +
<b>How to Make Data Amazing - Intro to Deep Learning #5
 +
</b><br>[[Creatives#Siraj Raval|Siraj Raval]]  In this video, we'll go through data preprocessing steps for 3 different datasets. We'll also go in depth on a dimensionality reduction technique called [[Principal Component Analysis (PCA)]].
 +
|}
 +
|<!-- M -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 +
<youtube>tChcZpBbTTA</youtube>
 +
<b>How to Learn from Little Data - Intro to Deep Learning #17
 +
</b><br>[[Creatives#Siraj Raval|Siraj Raval]]  One-shot learning! In this last weekly video of the course, i'll explain how [[memory]] augmented neural networks can help achieve one-shot classification for a small labeled image dataset. We'll also go over the architecture of it's inspiration (the neural turing machine).
 +
|}
 +
|}<!-- B -->
 +
 +
* [https://www.quora.com/What-are-the-alternatives-to-CrowdFlower Human in the Loop...]
 +
** [https://www.mturk.com/ Amazon Mechanical Turk (MTurk)]  - [https://blog.mturk.com/using-mturk-with-amazon-sagemaker-for-supervised-learning-ml-bc30f94e1c0d Using MTurk with Amazon SageMaker for Supervised Learning (ML)]
 +
** [https://gengo.ai/ Gengo.ai] - high-quality multilingual data with a human touch for machine learning
 +
** [https://visit.figure-eight.com/crowdflower-ai-info-old.html Figure Eight CrowdFlower AI] - build a state-of-the-art machine learning model trained with human labeled data

Latest revision as of 21:31, 26 April 2024

YouTube ... Quora ...Google search ...Google News ...Bing News

Datasets (often in combination with algorithms) are becoming more important themselves and can sometimes be seen as the primary intellectual output of the research. The revelations about Cambridge Analytica highlights the importance of datasets and data collection. Reference also: Privacy

Sources

Networks

Articles

"ImageNet: Where Have We Been? Where Are We Going?" with Fei-Fei Li
Date: 9/21/2017 It took nature and evolution more than 500 million years to develop a powerful visual system in humans. The journey for AI and computer vision is about half of a century. In this talk, Dr. Li will briefly discuss the key ideas and the cutting edge advances in the quest for visual intelligences in computers, focusing on work done to develop ImageNet over the years. Fei-Fei Li is currently on sabbatical as the Chief Scientist of AI/ML at Google Cloud. She is an Associate Professor in the Computer Science Department at Stanford, and the Director of the Stanford Artificial Intelligence Lab. Her main research areas are in machine learning, deep learning, computer vision, and cognitive and computational neuroscience. She has published more than 150 scientific articles in top-tier journals and conferences, including Nature, PNAS, Journal of Neuroscience, CVPR, ICCV, NIPS, ECCV, IJCV, IEEE-PAMI, etc. Li obtained her B.A. degree in physics from Princeton with High Honors, and her Ph.D. degree in electrical engineering from the California Institute of Technology (Caltech). She joined Stanford in 2009 as an assistant professor, and was promoted to associate professor with tenure in 2012.

GCP Public Datasets Program: Share and analyze large-scale global datasets (Google Cloud Next '17)
Publicly available large datasets hold great potential to better the world. In this video, Felipe Hoffa introduces the Public Datasets Program by Google Cloud Platform. The program gives dataset owners a terrific platform to share their data, so that users across the world can easily leverage these datasets for large-scale analytics. You'll learn how you can participate in this program, whether you want to broadly share your data or hope to glean insights from large public datasets. Missed the conference? Watch all the talks here: https://goo.gl/c1Vs3h Watch more talks about Big Data & Machine Learning here: https://goo.gl/OcqI9k

P2 How to download a Kaggle dataset & Install Numpy, Pandas, and more - Multiple Linear Regression
What’s up yall! We are back again. How was your weekend? After yesterday's introductory episode we are jumping straight in to the nitty gritty of multiple linear regression. But first, let's do some preparation.

Accessing public datasets on Amazon S3 using Globus
Demonstrates how you can easily access and download big datasets from public repositories using Globus for Amazon S3

AWS re:Invent 2017: Migrating Databases and Data Warehouses to the Cloud: Getting St (DAT317)
In this introductory session, we look at how to convert and migrate your commercial databases and data warehouses to the cloud and gain your database freedom. Amazon AWS Database Migration Service (AWS DMS) and AWS Schema Conversion Tool (AWS SCT) have been used to migrate tens of thousands of databases. These include Oracle and SQL Server to Amazon Aurora, Teradata and Netezza to Amazon Redshift, MongoDB to Amazon DynamoDB, and many other data source and target combinations. Learn how to easily and securely migrate your data and procedural code, enjoy flexibility and cost savings, and gain new opportunities.

Joining Datasets | Intro to Azure ML Part 6
Last time we prepared our dataset for a join. In this video we’ll use the join data module inside of Azure ML to cross reference each airport id with the airport table to find airport city, airport state, and airport name. We will briefly go over the different types of joins, then combine the three tables together. Each time we join we will add 3 columns to our dataset.

Deep learning idea for creating datasets
An idea to easily take snapshots or crops of images to break larger images into nice labled images for a database

ML #8 - Open Healthcare Datasets
Many people want healthcare data to play with, but don't know where to find it. In this chat we'll provide you the data resources you need to start doing machine learning.

Open Data Innovation: Building on Open Data Sets for Innovative Applications
An overarching conversation on open data innovation. The session highlights how democratizing access to information drives innovation and greater impact. Learn how organizations are using the cloud to gather data and discover insights to foster innovation, improve service delivery and address big societal problems. As data becomes more widely available (GIS, weather, research), having access to scalable technology and the multiple data sources that can feed into the technology solution can help create solutions for significant problems in the world. This session highlights real-world examples of how open data is enabling transformative innovation. Explore how the new Landsat open data set on AWS is spurring innovation among public and private entities or delivering applications to citizens and users.

VRmeta: Generating AI datasets one precise meta-tag at a time
VRmeta is the world's most precise means of adding time-based descriptive metadata to both digital and immersive video. Whether the goal is to create unmatched discoverability for your entire video library, leverage metrics from all that amazing content or license those clips to increase their inbound revenue - VRmeta makes it happen Today’s consumers have more choices than ever before for video entertainment and viewing platforms With this explosion of choice has come complexity. Finding engaging entertainment has become a time consuming and frustrating, resulting in declining engagement and viewer satisfaction The key to overcoming this discovery challenge lies in rich, time-based descriptive metadata VRmeta is your gateway to making this happen:​ VRmeta's patent-pending cross-hair and tactile navigation technology gives users the most precise means of applying metadata ever created​ VRmeta gives every clip it touches time and in-frame location data registered with in and out points, all saved into .csv and .xmp sidecar files​ VRmeta delivers AI precision now. VRmeta even learns your tagging vocabulary, offering users auto-completion for frequently used words and names​ By applying time-based descriptive metadata at the production level, stakeholders create additional value at every stage of the video content lifecycle​ VRmeta stands firmly at the nexus of artificial intelligence and healthcare, and is a recognized state-of-the-art solution central to the development of emotional AI datasets​ The science surrounding Sentiment Analysis involves natural language processing or linguistic algorithms that assign values to positive, negative or neutral text (converting supposition into monetizable data silos). VRmeta is the ideal method for inputting this data​ VRmeta is the tool of choice for broadcasters looking to develop information rich, statistical data silos for any variety of sports. Think team and player performance aggregate, post-game data and deep dive statistic development​ "Great content without accurate metadata is, after all, a missed opportunity"

Open Data Innovation: Building on Open Data Sets for Innovative Applications
An overarching conversation on open data innovation. The session highlights how democratizing access to information drives innovation and greater impact. Learn how organizations are using the cloud to gather data and discover insights to foster innovation, improve service delivery and address big societal problems. As data becomes more widely available (GIS, weather, research), having access to scalable technology and the multiple data sources that can feed into the technology solution can help create solutions for significant problems in the world. This session highlights real-world examples of how open data is enabling transformative innovation. Explore how the new Landsat open data set on AWS is spurring innovation among public and private entities or delivering applications to citizens and users.

Question Answering Beyond SQuAD: Larger Datasets and New Domains, with Branden Chan, deepset.ai
Branden Chan, an NLP Engineer at deepset.ai in Berlin, presents on Question Answering Beyond SQuAD: Larger Datasets and New Domains in an online program, May 26, 2020, organized and moderated by Seth Grimes for the New York Natural Language Processing meetup (https://www.meetup.com/NY-NLP) and partners.

How to Make Data Amazing - Intro to Deep Learning #5
Siraj Raval In this video, we'll go through data preprocessing steps for 3 different datasets. We'll also go in depth on a dimensionality reduction technique called Principal Component Analysis (PCA).

How to Learn from Little Data - Intro to Deep Learning #17
Siraj Raval One-shot learning! In this last weekly video of the course, i'll explain how memory augmented neural networks can help achieve one-shot classification for a small labeled image dataset. We'll also go over the architecture of it's inspiration (the neural turing machine).