Difference between revisions of "Datasets"
m |
m |
||
| Line 107: | Line 107: | ||
|| | || | ||
<youtube>jYvBmJo7qjc</youtube> | <youtube>jYvBmJo7qjc</youtube> | ||
| − | <b> | + | <b>"ImageNet: Where Have We Been? Where Are We Going?" with [[Creatives#Fei-Fei Li |Fei-Fei Li]] |
| − | </b><br> | + | </b><br>Date: 9/21/2017 It took nature and evolution more than 500 million years to develop a powerful visual system in humans. The journey for AI and computer vision is about half of a century. In this talk, [[Creatives#Fei-Fei Li |Dr. Li]] will briefly discuss the key ideas and the cutting edge advances in the quest for visual intelligences in computers, focusing on work done to develop ImageNet over the years. [[Creatives#Fei-Fei Li |Fei-Fei Li]] is currently on sabbatical as the Chief Scientist of AI/ML at Google Cloud. She is an Associate Professor in the Computer Science Department at Stanford, and the Director of the Stanford Artificial Intelligence Lab. Her main research areas are in machine learning, deep learning, computer vision, and cognitive and computational neuroscience. She has published more than 150 scientific articles in top-tier journals and conferences, including Nature, PNAS, Journal of Neuroscience, CVPR, ICCV, NIPS, ECCV, IJCV, IEEE-PAMI, etc. Li obtained her B.A. degree in physics from Princeton with High Honors, and her Ph.D. degree in electrical engineering from the California Institute of Technology (Caltech). She joined Stanford in 2009 as an assistant professor, and was promoted to associate professor with tenure in 2012. |
|} | |} | ||
|<!-- M --> | |<!-- M --> | ||
| Line 115: | Line 115: | ||
|| | || | ||
<youtube>dGM1mgkIayY</youtube> | <youtube>dGM1mgkIayY</youtube> | ||
| − | <b> | + | <b>GCP Public Datasets Program: Share and analyze large-scale global datasets ([[Google]] Cloud Next '17) |
| − | </b><br> | + | </b><br>Publicly available large datasets hold great potential to better the world. In this video, Felipe Hoffa introduces the Public Datasets Program by [[Google]] Cloud Platform. The program gives dataset owners a terrific platform to share their data, so that users across the world can easily leverage these datasets for large-scale analytics. You'll learn how you can participate in this program, whether you want to broadly share your data or hope to glean insights from large public datasets. |
| + | Missed the conference? Watch all the talks here: http://goo.gl/c1Vs3h Watch more talks about Big Data & Machine Learning here: http://goo.gl/OcqI9k | ||
|} | |} | ||
|}<!-- B --> | |}<!-- B --> | ||
| Line 124: | Line 125: | ||
|| | || | ||
<youtube>tTjZqz6qk1s</youtube> | <youtube>tTjZqz6qk1s</youtube> | ||
| − | <b> | + | <b>P2 How to download a Kaggle dataset & Install Numpy, Pandas, and more - Multiple Linear Regression |
| − | </b><br> | + | </b><br>What’s up yall! We are back again. How was your weekend? After yesterday's introductory episode we are jumping straight in to the nitty gritty of multiple linear regression. But first, let's do some preparation. |
|} | |} | ||
|<!-- M --> | |<!-- M --> | ||
| Line 132: | Line 133: | ||
|| | || | ||
<youtube>uG5dAhNQgDU</youtube> | <youtube>uG5dAhNQgDU</youtube> | ||
| − | <b> | + | <b>Accessing public datasets on [[Amazon]] S3 using Globus |
| − | </b><br> | + | </b><br>Demonstrates how you can easily access and download big datasets from public repositories using Globus for [[Amazon]] S3 |
|} | |} | ||
|}<!-- B --> | |}<!-- B --> | ||
| Line 141: | Line 142: | ||
|| | || | ||
<youtube>Y33TviLMBFY</youtube> | <youtube>Y33TviLMBFY</youtube> | ||
| − | <b> | + | <b>AWS re:Invent 2017: Migrating Databases and Data Warehouses to the Cloud: Getting St (DAT317) |
| − | </b><br> | + | </b><br>In this introductory session, we look at how to convert and migrate your commercial databases and data warehouses to the cloud and gain your database freedom. [[Amazon]] AWS Database Migration Service (AWS DMS) and AWS Schema Conversion Tool (AWS SCT) have been used to migrate tens of thousands of databases. These include Oracle and SQL Server to [[Amazon]] Aurora, Teradata and Netezza to Amazon Redshift, MongoDB to [[Amazon]] DynamoDB, and many other data source and target combinations. Learn how to easily and securely migrate your data and procedural code, enjoy flexibility and cost savings, and gain new opportunities. |
|} | |} | ||
|<!-- M --> | |<!-- M --> | ||
| Line 149: | Line 150: | ||
|| | || | ||
<youtube>mLOIPTbgqUg</youtube> | <youtube>mLOIPTbgqUg</youtube> | ||
| − | <b> | + | <b>Joining Datasets | Intro to Azure ML Part 6 |
| − | </b><br> | + | </b><br>Last time we prepared our dataset for a join. In this video we’ll use the join data module inside of Azure ML to cross reference each airport id with the airport table to find airport city, airport state, and airport name. We will briefly go over the different types of joins, then combine the three tables together. Each time we join we will add 3 columns to our dataset. |
|} | |} | ||
|}<!-- B --> | |}<!-- B --> | ||
| Line 158: | Line 159: | ||
|| | || | ||
<youtube>FDouW7fSIms</youtube> | <youtube>FDouW7fSIms</youtube> | ||
| − | <b> | + | <b>Deep learning idea for creating datasets |
| − | </b><br> | + | </b><br>An idea to easily take snapshots or crops of images to break larger images into nice labled images for a database |
|} | |} | ||
|<!-- M --> | |<!-- M --> | ||
| Line 166: | Line 167: | ||
|| | || | ||
<youtube>av8zkywXHSM</youtube> | <youtube>av8zkywXHSM</youtube> | ||
| − | <b> | + | <b>VRmeta: Generating AI datasets one precise meta-tag at a time |
| − | </b><br> | + | </b><br>VRmeta is the world's most precise means of adding time-based descriptive metadata to both digital and immersive video. Whether the goal is to create unmatched discoverability for your entire video library, leverage metrics from all that amazing content or license those clips to increase their inbound revenue - VRmeta makes it happen Today’s consumers have more choices than ever before for video entertainment and viewing platforms With this explosion of choice has come complexity. Finding engaging entertainment has become a time consuming and frustrating, resulting in declining engagement and viewer satisfaction The key to overcoming this discovery challenge lies in rich, time-based descriptive metadata VRmeta is your gateway to making this happen: VRmeta's patent-pending cross-hair and tactile navigation technology gives users the most precise means of applying metadata ever created VRmeta gives every clip it touches time and in-frame location data registered with in and out points, all saved into .csv and .xmp sidecar files VRmeta delivers AI precision now. VRmeta even learns your tagging vocabulary, offering users auto-completion for frequently used words and names By applying time-based descriptive metadata at the production level, stakeholders create additional value at every stage of the video content lifecycle VRmeta stands firmly at the nexus of artificial intelligence and healthcare, and is a recognized state-of-the-art solution central to the development of emotional AI datasets The science surrounding sentiment analysis involves natural language processing or linguistic algorithms that assign values to positive, negative or neutral text (converting supposition into monetizable data silos). VRmeta is the ideal method for inputting this data VRmeta is the tool of choice for broadcasters looking to develop information rich, statistical data silos for any variety of sports. Think team and player performance aggregate, post-game data and deep dive statistic development "Great content without accurate metadata is, after all, a missed opportunity" |
|} | |} | ||
|}<!-- B --> | |}<!-- B --> | ||
| Line 175: | Line 176: | ||
|| | || | ||
<youtube>WhYrLA-fOK0</youtube> | <youtube>WhYrLA-fOK0</youtube> | ||
| − | <b> | + | <b>Open Data Innovation: Building on Open Data Sets for Innovative Applications |
| − | </b><br> | + | </b><br>An overarching conversation on open data innovation. The session highlights how democratizing access to information drives innovation and greater impact. Learn how organizations are using the cloud to gather data and discover insights to foster innovation, improve service delivery and address big societal problems. As data becomes more widely available (GIS, weather, research), having access to scalable technology and the multiple data sources that can feed into the technology solution can help create solutions for significant problems in the world. This session highlights real-world examples of how open data is enabling transformative innovation. Explore how the new Landsat open data set on AWS is spurring innovation among public and private entities or delivering applications to citizens and users. |
|} | |} | ||
|<!-- M --> | |<!-- M --> | ||
Revision as of 12:50, 7 September 2020
YouTube search... ...Google search
- AI Governance
- Hyperparameters
- Visualization
- Facets | Google...contains two robust Visualizations to aid in understanding and analyzing machine learning datasets.
- OpenML datasets
- Datasets and Machine Learning | Chris Nicholson - A.I. Wiki pathmind
- Datasets used in deep learning applications within X-ray security imaging | Towards Automatic Threat Detection: A Survey of Advances of Deep Learning within X-ray Security Imaging | Samet Akcay and Toby P. Breckon - Durham University, UK
Datasets (often in combination with algorithms) are becoming more important themselves and can sometimes be seen as the primary intellectual output of the research. The revelations about Cambridge Analytica highlights the importance of datasets and data collection. Reference also: Privacy in Data Science
Sources
- Tatoeba a collection of sentences and translations - Tab-delimited Bilingual Sentence Pairs
- Kaggle Datasets
- COVID-19 Open Research Dataset (CORD-19) ...COVID-19
- UC Irvine Machine Learning Repository
- MNIST database
- Collections | DataHub
- Registry of Open Data on AWS | Amazon
- Public Data | Google
- BigQuery public datasets | Google
- Open Images | Google
- Data Science for Research | Microsoft
- Datasets for Data Mining and Data Science | KDnuggets
- Enigma Public
- A Comprehensive List of Open Data Portals from Around the World | DataPortals.org
- OpenDataSoft
- World Data Atlas | Knoema
- The Open Machine Learning project | OpenML.org
- World's Free Online Data | Research Pipeline
- List of datasets for machine learning research | Wikipedia
- Neural Net Repository | Wolfram
- Open Data for Deep Learning & Machine Learning | 4j
- Data Catalog | Data.gov
- 3D-Machine-Learning | GitHub
- Wind Turbine Map and Database | USGS & DOE
- Autosomal DNA
- Pascal Visual Object Classes Challenge (VOC)
- OpenNASA
- Data: Close encounters between two objects |European Space Agency (ESA)
- JASA Data Archive | Journal of the American Statistical Association
- Datasets Archive | Journal of the American Statistical Association
- Data.World
- The Dataset Collection | Archive.org
- Collections |Archive-it.org
- Eurostat | EU statistical office
- Re3data
- Resource on data and metadata standards - open research data | FAIRsharing
- List of Public Data Sources Fit for Machine Learning | bigml
- Open Datasets | Skymind
- Global Health Observatory resources | World Health Organization (WHO)
- CDC WONDER | Center for Disease Control (CDC)
- US health insurance program | Medicare
- International economy |International Monetary Fund (IMF)
- Data Catalog }| The World Bank
- Financial and economic | Quandl
- PublicDomains | GitHub
- datasets and related content | BuzzFeed - GitHub
- Sports, politics, economics, and other spheres of life | FiveThirtyEight
- EMBER; benign and malicious Windows-portable executable files | Endgame - GitHub
- r/datasets | reddit
- Microsoft Information-Seeking Conversation (MISC) - audio and video signals; transcripts of conversation
- Language-Independent Named Entity Recognition (II)
- VGG | Oxford
- Perfect-500K beauty and personal care
- Mozilla’s Common Voice project collect human voices
- CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset. | A. Krizhevsky, V. Nair, and G. Hinton - Canadian Institute For Advanced Research]
Networks
- Bidirectional Encoder Representations from Transformers (BERT)
- ResNet-50
- ImageNet | Wikipedia
- AlexNet | Wikipedia
- WordNet
Articles
- Microsoft Scraps 10 Million Facial Recognition Photos On The Low | Kori Hale -Forbes
- The 50 Best Free Datasets for Machine Learning | Meiryum Ali - Gengo AI
- The 50 Best Public Datasets for Machine Learning | Stacy Stanford - Medium
- Best Public Datasets for Machine Learning and Data Science: Sources and Advice on the Choice | Altexsoft
- 25 Open Datasets for Deep Learning Every Data Scientist Must Work With | PRANAV DAR - Analytics Vidhya
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- Human in the Loop...
- Amazon Mechanical Turk (MTurk) - Using MTurk with Amazon SageMaker for Supervised Learning (ML)
- Gengo.ai - high-quality multilingual data with a human touch for machine learning
- Figure Eight CrowdFlower AI - build a state-of-the-art machine learning model trained with human labeled data