Difference between revisions of "BLIP-2"

From
Jump to: navigation, search
m
m
Line 24: Line 24:
 
* [[Prompt Engineering (PE)]]
 
* [[Prompt Engineering (PE)]]
 
* [[Foundation Models (FM)]]
 
* [[Foundation Models (FM)]]
* [[Singularity]] ... [[Moonshots]] ... [[Emergence]] ... [[Explainable / Interpretable AI]] ... [[Artificial General Intelligence (AGI)| AGI]] ... [[Inside Out - Curious Optimistic Reasoning]] ... [[Algorithm Administration#Automated Learning|Automated Learning]]
+
* [[Singularity]] ... [[Artificial Consciousness / Sentience|Sentience]] ... [[Artificial General Intelligence (AGI)| AGI]] ... [[Inside Out - Curious Optimistic Reasoning| Curious Reasoning]] ... [[Emergence]] ... [[Moonshots]] ... [[Explainable / Interpretable AI|Explainable AI]] ... [[Algorithm Administration#Automated Learning|Automated Learning]]
 
* [https://huggingface.co/blog/blip-2 Zero-shot image-to-text generation with BLIP-2 | Maria Khalusova - ] [[Hugging Face]]  ... how to use BLIP-2 for image captioning, prompted image captioning, visual question-answering, and chat-based prompting
 
* [https://huggingface.co/blog/blip-2 Zero-shot image-to-text generation with BLIP-2 | Maria Khalusova - ] [[Hugging Face]]  ... how to use BLIP-2 for image captioning, prompted image captioning, visual question-answering, and chat-based prompting
 
* [https://github.com/Vision-CAIR/ChatCaptioner Interactive ChatCaptioner for image and video | Vision-CAIR - GitHub]
 
* [https://github.com/Vision-CAIR/ChatCaptioner Interactive ChatCaptioner for image and video | Vision-CAIR - GitHub]

Revision as of 01:20, 7 May 2023

YouTube ... Quora ...Google search ...Google News ...Bing News


BLIP-2 is a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models1. It achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. BLIP-2 bridges the modality gap between vision and language models by adding a lightweight Querying Transformer (Q-Former) between an off-the-shelf frozen pre-trained image encoder and a frozen large language model. Q-Former is the only trainable part of BLIP-2; both the image encoder and language model remain frozen.

  • Q-Former is a transformer model that consists of two submodules that share the same self-attention layers: an image transformer that interacts with the frozen image encoder for visual feature extraction and a text transformer that can function as both a text encoder and a text decoder.

One key difference between BLIP-2 and other vision-language models is that BLIP-2 introduces a new visual-language pre-training paradigm that can potentially leverage any combination of pre-trained vision encoder and LLM without having to pre-train the whole architecture end to end. This enables achieving state-of-the-art results on multiple visual-language tasks while significantly reducing the number of trainable parameters and pre-training costs.

The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 is pre-trained in two stages:

  • 1st stage: bootstraps vision-language representation learning from a frozen image encoder.
  • 2nd stage bootstraps vision-to-language generative learning from a frozen language model.


BLIP-2 is a zero-shot visual-language model that can be used for multiple image-to-text tasks with image and image and text prompts. Some of the image-to-text tasks that visual language models can tackle include image captioning, image-text retrieval, and visual question answering. Image captioning can aid the visually impaired, create useful product descriptions, identify inappropriate content beyond text, and more. Image-text retrieval can be applied in multimodal search, as well as in applications such as autonomous driving. Visual question-answering can aid in education, enable multimodal chatbots, and assist in various domain-specific information retrieval applications.

  • Visual question answering (VQA) is a task in which a model is presented with an image and a natural language question about the image. The model must then provide a natural language answer to the question based on the content of the image. VQA can aid in education, enable multimodal chatbots, and assist in various domain-specific information retrieval applications. It can aid in education by providing students with a tool to ask questions about visual content and receive answers. It can also enable multimodal chatbots that can answer questions about images and other visual content. VQA can also assist in various domain-specific information retrieval applications, such as helping doctors retrieve information from medical images.




Code a VISION - LLM w/ ViT, FLAN-T5 LLM & BLIP-2: Multimodal LLMs (MLLM)