Difference between revisions of "Vision"

From
Jump to: navigation, search
m (Vision Transformers (ViT))
m (Vision Transformers (ViT))
Line 56: Line 56:
  
 
= <span id="Vision Transformers (ViT)"></span>Vision Transformers (ViT) =
 
= <span id="Vision Transformers (ViT)"></span>Vision Transformers (ViT) =
 +
[https://www.youtube.com/results?search_query=ai+Vision+Transformer+ViT YouTube]
 +
[https://www.quora.com/search?q=ai%20X...X%20X...X%20X...X%20X...X%20X...X%20X...X%20X...X%20X...X ... Quora]
 +
[https://www.google.com/search?q=ai+Vision+Transformer+ViT ...Google search]
 +
[https://news.google.com/search?q=ai+Vision+Transformer+ViT ...Google News]
 +
[https://www.bing.com/news/search?q=ai+Vision+Transformer+ViT&qft=interval%3d%228%22 ...Bing News]
 +
 +
[https://arxiv.org/abs/2010.11929 An Image is Worth 16x16 Words: Transformers for Image]
 +
[https://en.wikipedia.org/wiki/Vision_transformer Vision transformer - Wikipedia]
 +
[https://huggingface.co/docs/transformers/model_doc/vit Vision Transformer (ViT)] | [[Hugging Face]]
 +
[https://github.com/google-research/vision_transformer google-research/vision_transformer - GitHub]
 +
 +
A **Vision Transformer (ViT)** is a transformer that is targeted at vision processing tasks such as image recognition¹. It was first proposed in 2019 by Cordonnier et al. and later empirically evaluated more extensively in the well-known paper "An image is worth 16x16 words".  ViT works by breaking down input images into a series of patches which, once transformed into vectors, are seen as words in a normal transformer. Each image is split into a sequence of fixed-size non-overlapping patches, which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image, which can be used for classification. The authors also add absolute position embeddings and feed the resulting sequence of vectors to a standard Transformer encoder. A [CLS] token is a special token that is used in classification tasks. It stands for “classification” and is used as the only input of the final MLP Head as it has been influenced by all the others. In the case of ViT, it is added to serve as representation of an entire image. The final MLP Head refers to the final Multi-Layer Perceptron (MLP) layer in the model. It takes the [CLS] token as input and outputs the final classification result.
  
 
<youtube>TrdevFK_am4</youtube>
 
<youtube>TrdevFK_am4</youtube>
Line 61: Line 73:
 
<youtube>r88L_yLJ4CE</youtube>
 
<youtube>r88L_yLJ4CE</youtube>
 
<youtube>J-utjBdLCTo</youtube>
 
<youtube>J-utjBdLCTo</youtube>
 +
<youtube>DVoHvmww2lQ</youtube>
 +
<youtube>HZ4j_U3FC94</youtube>
 +
<youtube>i2_zJ0ANrw0</youtube>
 +
<youtube>hPb6A92LROc</youtube>
  
 
= Image Retrieval / Object Detection =
 
= Image Retrieval / Object Detection =

Revision as of 13:17, 29 April 2023

YouTube ... Quora ...Google search ...Google News ...Bing News


Computer vision involves developing algorithms and techniques to enable computers to interpret, analyze, and understand visual data from the world around them. It focuses on developing methods for computers to perceive, analyze, and interpret digital images or videos in a way that simulates human vision. Computer vision involves tasks such as object detection, image recognition, segmentation, tracking, and scene reconstruction. It is used in a wide range of applications, including robotics, autonomous vehicles, security and surveillance, medical imaging, and augmented reality.



Vision Transformers (ViT)

YouTube ... Quora ...Google search ...Google News ...Bing News

An Image is Worth 16x16 Words: Transformers for Image Vision transformer - Wikipedia Vision Transformer (ViT) | Hugging Face google-research/vision_transformer - GitHub

A **Vision Transformer (ViT)** is a transformer that is targeted at vision processing tasks such as image recognition¹. It was first proposed in 2019 by Cordonnier et al. and later empirically evaluated more extensively in the well-known paper "An image is worth 16x16 words". ViT works by breaking down input images into a series of patches which, once transformed into vectors, are seen as words in a normal transformer. Each image is split into a sequence of fixed-size non-overlapping patches, which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image, which can be used for classification. The authors also add absolute position embeddings and feed the resulting sequence of vectors to a standard Transformer encoder. A [CLS] token is a special token that is used in classification tasks. It stands for “classification” and is used as the only input of the final MLP Head as it has been influenced by all the others. In the case of ViT, it is added to serve as representation of an entire image. The final MLP Head refers to the final Multi-Layer Perceptron (MLP) layer in the model. It takes the [CLS] token as input and outputs the final classification result.

Image Retrieval / Object Detection


Faster Region-based Convolutional Neural Networks (R-CNN), You only Look Once (YOLO), Single Shot Detector(SSD)

LiDAR

YouTube search...