Difference between revisions of "Vision"
m (→Image Retrieval / Object Detection) |
m |
||
(31 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
|title=PRIMO.ai | |title=PRIMO.ai | ||
|titlemode=append | |titlemode=append | ||
− | |keywords=artificial, intelligence, machine, learning, models | + | |keywords=ChatGPT, artificial, intelligence, machine, learning, GPT-4, GPT-5, NLP, NLG, NLC, NLU, models, data, singularity, moonshot, Sentience, AGI, Emergence, Moonshot, Explainable, TensorFlow, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Hugging Face, OpenAI, Tensorflow, OpenAI, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Meta, LLM, metaverse, assistants, agents, digital twin, IoT, Transhumanism, Immersive Reality, Generative AI, Conversational AI, Perplexity, Bing, You, Bard, Ernie, prompt Engineering LangChain, Video/Image, Vision, End-to-End Speech, Synthesize Speech, Speech Recognition, Stanford, MIT |description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools |
− | |description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools | + | |
+ | <!-- Google tag (gtag.js) --> | ||
+ | <script async src="https://www.googletagmanager.com/gtag/js?id=G-4GCWLBVJ7T"></script> | ||
+ | <script> | ||
+ | window.dataLayer = window.dataLayer || []; | ||
+ | function gtag(){dataLayer.push(arguments);} | ||
+ | gtag('js', new Date()); | ||
+ | |||
+ | gtag('config', 'G-4GCWLBVJ7T'); | ||
+ | </script> | ||
}} | }} | ||
[https://www.youtube.com/results?search_query=ai+computer+vision+video YouTube] | [https://www.youtube.com/results?search_query=ai+computer+vision+video YouTube] | ||
Line 11: | Line 20: | ||
[https://www.bing.com/news/search?q=ai+computer+vision+video&qft=interval%3d%228%22 ...Bing News] | [https://www.bing.com/news/search?q=ai+computer+vision+video&qft=interval%3d%228%22 ...Bing News] | ||
− | + | * [[Video/Image]] ... [[Vision]] ... [[Enhancement]] ... [[Fake]] ... [[Reconstruction]] ... [[Colorize]] ... [[Occlusions]] ... [[Predict image]] ... [[Image/Video Transfer Learning]] ... [[Art]] ... [[Photography]] | |
+ | * [[End-to-End Speech]] ... [[Synthesize Speech]] ... [[Speech Recognition]] ... [[Music]] | ||
+ | * [[Robotics]] ... [[Transportation (Autonomous Vehicles)|Vehicles]] ... [[Autonomous Drones|Drones]] ... [[3D Model]] ... [[Point Cloud]] | ||
+ | * [[Simulation]] ... [[Simulated Environment Learning]] ... [[World Models]] ... [[Minecraft]]: [[Minecraft#Voyager|Voyager]] | ||
* [[Case Studies]] | * [[Case Studies]] | ||
− | |||
** [[Agriculture]] | ** [[Agriculture]] | ||
** [[Healthcare]] | ** [[Healthcare]] | ||
** [[Astronomy]] | ** [[Astronomy]] | ||
− | ** [[Warehousing]] | + | ** [[Supply Chain]] ... [[Supply Chain#Logistics|Logistics]] ... [[Supply Chain#Warehousing|Warehousing]] ... [[Supply Chain#Retail|Retail]] |
− | * | + | * [[Cybersecurity]] ... [[Open-Source Intelligence - OSINT |OSINT]] ... [[Cybersecurity Frameworks, Architectures & Roadmaps | Frameworks]] ... [[Cybersecurity References|References]] ... [[Offense - Adversarial Threats/Attacks| Offense]] ... [[National Institute of Standards and Technology (NIST)|NIST]] ... [[U.S. Department of Homeland Security (DHS)| DHS]] ... [[Screening; Passenger, Luggage, & Cargo|Screening]] ... [[Law Enforcement]] ... [[Government Services|Government]] ... [[Defense]] ... [[Joint Capabilities Integration and Development System (JCIDS)#Cybersecurity & Acquisition Lifecycle Integration| Lifecycle Integration]] ... [[Cybersecurity Companies/Products|Products]] ... [[Cybersecurity: Evaluating & Selling|Evaluating]] |
− | + | * [[Bird Identification]] | |
− | |||
− | |||
− | |||
* [[Predict image]] | * [[Predict image]] | ||
− | |||
* [[DeepLens - deep learning enabled video camera]] | * [[DeepLens - deep learning enabled video camera]] | ||
** [[Polly]] | ** [[Polly]] | ||
Line 36: | Line 43: | ||
* [http://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/object_localization_and_detection.html Object Localization and Detection | leonardoaraujosantos] | * [http://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/object_localization_and_detection.html Object Localization and Detection | leonardoaraujosantos] | ||
* [https://analyticsindiamag.com/why-tesla-invented-a-new-neural-network/ Why Tesla Invented A New Neural Network | Ambika Choudhury - Analytics India Magazine] | * [https://analyticsindiamag.com/why-tesla-invented-a-new-neural-network/ Why Tesla Invented A New Neural Network | Ambika Choudhury - Analytics India Magazine] | ||
+ | * [https://ai.facebook.com/blog/segment-anything-foundation-model-image-segmentation Introducing Segment Anything: Working toward the first foundation model for image segmentation | Meta AI] ... identifying which image pixels belong to an object | ||
+ | Computer vision involves developing algorithms and techniques to enable computers to interpret, analyze, and understand visual data from the world around them. It focuses on developing methods for computers to perceive, analyze, and interpret digital images or videos in a way that simulates human vision. Computer vision involves tasks such as object detection, image recognition, segmentation, tracking, and scene reconstruction. It is used in a wide range of applications, including robotics, autonomous vehicles, security and surveillance, medical imaging, and augmented reality. | ||
− | + | = Interpret, Analyze, & Understand = | |
− | |||
− | |||
<youtube>vecXPWlRaLw</youtube> | <youtube>vecXPWlRaLw</youtube> | ||
<youtube>l2ttxMVFM3s</youtube> | <youtube>l2ttxMVFM3s</youtube> | ||
Line 55: | Line 62: | ||
<youtube>o2Z8T0D5f3I</youtube> | <youtube>o2Z8T0D5f3I</youtube> | ||
+ | |||
+ | = <span id="Vision Transformers (ViT)"></span>Vision Transformers (ViT) = | ||
+ | [https://www.youtube.com/results?search_query=ai+Vision+Transformer+ViT YouTube] | ||
+ | [https://www.quora.com/search?q=ai%20X...X%20X...X%20X...X%20X...X%20X...X%20X...X%20X...X%20X...X ... Quora] | ||
+ | [https://www.google.com/search?q=ai+Vision+Transformer+ViT ...Google search] | ||
+ | [https://news.google.com/search?q=ai+Vision+Transformer+ViT ...Google News] | ||
+ | [https://www.bing.com/news/search?q=ai+Vision+Transformer+ViT&qft=interval%3d%228%22 ...Bing News] | ||
+ | |||
+ | * [https://arxiv.org/abs/2010.11929 An Image is Worth 16x16 Words: Transformers for Image] | ||
+ | * [https://en.wikipedia.org/wiki/Vision_transformer Vision transformer - Wikipedia] | ||
+ | * [[BLIP-2#Code a VISION - LLM w/ ViT, FLAN-T5 LLM & BLIP-2: Multimodal LLMs (MLLM) | Code a VISION - LLM w/ ViT, FLAN-T5 LLM & BLIP-2: Multimodal LLMs (MLLM)]] | ||
+ | * [https://huggingface.co/docs/transformers/model_doc/vit Vision Transformer (ViT)] | [[Hugging Face]] | ||
+ | * [https://github.com/google-research/vision_transformer google-research/vision_transformer - GitHub] | ||
+ | * [https://theaisummer.com/vision-transformer/ How the Vision Transformer (ViT) works in 10 minutes: an image is worth 16x16 words | Nikolas Adaloglou - AI Summer] | ||
+ | |||
+ | ViT is a transformer that is targeted at vision processing tasks such as image recognition. It was first proposed in 2019 by Cordonnier et al. and later empirically evaluated more extensively in the well-known paper "An image is worth 16x16 words". ViT works by breaking down input images into a series of patches which, once transformed into vectors, are seen as words in a normal transformer. Each image is split into a sequence of fixed-size non-overlapping patches, which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image, which can be used for classification. The authors also add absolute position embeddings and feed the resulting sequence of vectors to a standard Transformer encoder. A [CLS] token is a special token that is used in classification tasks. It stands for “classification” and is used as the only input of the final MLP Head as it has been influenced by all the others. In the case of ViT, it is added to serve as representation of an entire image. The final MLP Head refers to the final Multi-Layer Perceptron (MLP) layer in the model. It takes the [CLS] token as input and outputs the final classification result. | ||
+ | |||
+ | |||
+ | <img src="https://media.giphy.com/media/ATsWtUsuuFRfq8OhZ7/source.gif" width="800"> | ||
+ | |||
+ | |||
+ | <youtube>TrdevFK_am4</youtube> | ||
+ | <youtube>qU7wO02urYU</youtube> | ||
+ | <youtube>r88L_yLJ4CE</youtube> | ||
+ | <youtube>J-utjBdLCTo</youtube> | ||
+ | <youtube>DVoHvmww2lQ</youtube> | ||
+ | <youtube>HZ4j_U3FC94</youtube> | ||
+ | <youtube>i2_zJ0ANrw0</youtube> | ||
+ | <youtube>hPb6A92LROc</youtube> | ||
= Image Retrieval / Object Detection = | = Image Retrieval / Object Detection = | ||
Line 79: | Line 115: | ||
<youtube>MoMjIwGSFVQ</youtube> | <youtube>MoMjIwGSFVQ</youtube> | ||
<youtube>Rgpfk6eYxJA</youtube> | <youtube>Rgpfk6eYxJA</youtube> | ||
− | <youtube> | + | <youtube>eDIj5LuIL4A</youtube> |
<youtube>nG3tT31nPmQ</youtube> | <youtube>nG3tT31nPmQ</youtube> | ||
+ | == <span id="Segment Anything Model (SAM)"></span>Segment Anything Model (SAM) == | ||
+ | [https://www.youtube.com/results?search_query=Segment+Anything+Model+SAM YouTube] | ||
+ | [https://www.quora.com/search?q=Segment%20Anything%20Model%20SAM ... Quora] | ||
+ | [https://www.google.com/search?q=Segment+Anything+Model+SAM ...Google search] | ||
+ | [https://news.google.com/search?q=Segment+Anything+Model+SAM ...Google News] | ||
+ | [https://www.bing.com/news/search?q=Segment+Anything+Model+SAM&qft=interval%3d%228%22 ...Bing News] | ||
+ | |||
+ | * [https://segment-anything.com/ Segment Anything |] [[Meta]] | ||
+ | |||
+ | Segment Anything Model (SAM) and the Segment Anything 1-Billion mask dataset (SA-1B), which is the most extensive segmentation dataset to date, democratizes image segmentation by introducing a new task, dataset, and model. Using an efficient model within a data collection loop, Meta AI researchers have constructed the largest segmentation dataset thus far, containing over 1 billion masks on 11 million licensed and privacy-respecting images. The model has been purposefully designed and trained to be promptable, enabling zero-shot transfer to new image distributions and tasks. [https://www.infoq.com/news/2023/04/meta-ai-sam/ Meta AI Introduces the Segment Anything Model, a Game-Changing Model for Object Segmentation | Daniel Dominguez - InfoQ] | ||
+ | |||
+ | <youtube>KP0LGE5Qrlw</youtube> | ||
+ | <youtube>D-D6ZmadzPE</youtube> | ||
== Faster Region-based Convolutional Neural Networks (R-CNN), You only Look Once (YOLO), Single Shot Detector(SSD) == | == Faster Region-based Convolutional Neural Networks (R-CNN), You only Look Once (YOLO), Single Shot Detector(SSD) == | ||
Line 91: | Line 140: | ||
<youtube>Gc233mo6r9c</youtube> | <youtube>Gc233mo6r9c</youtube> | ||
<youtube>P8e-G-Mhx4k</youtube> | <youtube>P8e-G-Mhx4k</youtube> | ||
− | <youtube> | + | <youtube>0Z0v8GVbOUg</youtube> |
<youtube>C5-SEZ_IvaM</youtube> | <youtube>C5-SEZ_IvaM</youtube> | ||
Latest revision as of 07:59, 16 June 2024
YouTube ... Quora ...Google search ...Google News ...Bing News
- Video/Image ... Vision ... Enhancement ... Fake ... Reconstruction ... Colorize ... Occlusions ... Predict image ... Image/Video Transfer Learning ... Art ... Photography
- End-to-End Speech ... Synthesize Speech ... Speech Recognition ... Music
- Robotics ... Vehicles ... Drones ... 3D Model ... Point Cloud
- Simulation ... Simulated Environment Learning ... World Models ... Minecraft: Voyager
- Case Studies
- Agriculture
- Healthcare
- Astronomy
- Supply Chain ... Logistics ... Warehousing ... Retail
- Cybersecurity ... OSINT ... Frameworks ... References ... Offense ... NIST ... DHS ... Screening ... Law Enforcement ... Government ... Defense ... Lifecycle Integration ... Products ... Evaluating
- Bird Identification
- Predict image
- DeepLens - deep learning enabled video camera
- Rekognition Video
- Deep Learning (DL) Amazon Machine Image (AMI) - DLAMI
- Image Classification
- Image-to-Image Translation
- Landing AI ... LandingLens™, an enterprise AIOps/MLOps platform that offers to build, iterate, and operationalize AI powered visual inspection solutions for manufacturers
- Object Detection Using Convolutional Neural Networks | The Straight Dope
- Object detection with neural networks — a simple tutorial using keras | Johannes Rieke - Towards Data Science
- Object Localization and Detection | leonardoaraujosantos
- Why Tesla Invented A New Neural Network | Ambika Choudhury - Analytics India Magazine
- Introducing Segment Anything: Working toward the first foundation model for image segmentation | Meta AI ... identifying which image pixels belong to an object
Computer vision involves developing algorithms and techniques to enable computers to interpret, analyze, and understand visual data from the world around them. It focuses on developing methods for computers to perceive, analyze, and interpret digital images or videos in a way that simulates human vision. Computer vision involves tasks such as object detection, image recognition, segmentation, tracking, and scene reconstruction. It is used in a wide range of applications, including robotics, autonomous vehicles, security and surveillance, medical imaging, and augmented reality.
Contents
Interpret, Analyze, & Understand
Vision Transformers (ViT)
YouTube ... Quora ...Google search ...Google News ...Bing News
- An Image is Worth 16x16 Words: Transformers for Image
- Vision transformer - Wikipedia
- Code a VISION - LLM w/ ViT, FLAN-T5 LLM & BLIP-2: Multimodal LLMs (MLLM)
- Vision Transformer (ViT) | Hugging Face
- google-research/vision_transformer - GitHub
- How the Vision Transformer (ViT) works in 10 minutes: an image is worth 16x16 words | Nikolas Adaloglou - AI Summer
ViT is a transformer that is targeted at vision processing tasks such as image recognition. It was first proposed in 2019 by Cordonnier et al. and later empirically evaluated more extensively in the well-known paper "An image is worth 16x16 words". ViT works by breaking down input images into a series of patches which, once transformed into vectors, are seen as words in a normal transformer. Each image is split into a sequence of fixed-size non-overlapping patches, which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image, which can be used for classification. The authors also add absolute position embeddings and feed the resulting sequence of vectors to a standard Transformer encoder. A [CLS] token is a special token that is used in classification tasks. It stands for “classification” and is used as the only input of the final MLP Head as it has been influenced by all the others. In the case of ViT, it is added to serve as representation of an entire image. The final MLP Head refers to the final Multi-Layer Perceptron (MLP) layer in the model. It takes the [CLS] token as input and outputs the final classification result.
Image Retrieval / Object Detection
- Feature:
- Character Recognition
- (Deep) Convolutional Neural Network (DCNN/CNN)
- ResNet-50
- Getting Started & Project: Object Detection
Segment Anything Model (SAM)
YouTube ... Quora ...Google search ...Google News ...Bing News
Segment Anything Model (SAM) and the Segment Anything 1-Billion mask dataset (SA-1B), which is the most extensive segmentation dataset to date, democratizes image segmentation by introducing a new task, dataset, and model. Using an efficient model within a data collection loop, Meta AI researchers have constructed the largest segmentation dataset thus far, containing over 1 billion masks on 11 million licensed and privacy-respecting images. The model has been purposefully designed and trained to be promptable, enabling zero-shot transfer to new image distributions and tasks. Meta AI Introduces the Segment Anything Model, a Game-Changing Model for Object Segmentation | Daniel Dominguez - InfoQ
Faster Region-based Convolutional Neural Networks (R-CNN), You only Look Once (YOLO), Single Shot Detector(SSD)
LiDAR