Difference between revisions of "Transformer"

From
Jump to: navigation, search
m
(30 intermediate revisions by the same user not shown)
Line 9: Line 9:
  
 
* [[Attention]]
 
* [[Attention]]
 +
* [[Generative Pre-trained Transformer (GPT)]]
 +
* [http://jalammar.github.io/illustrated-transformer/ The Illustrated Transformer | Jay Alammar]
 
* [http://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04 What is a Transformer? | Maxime Allard - Medium]
 
* [http://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04 What is a Transformer? | Maxime Allard - Medium]
* [[Recurrent Neural Network (RNN)]]
+
* Sequence to Sequence (Seq2Seq)  ----> [[Recurrent Neural Networks (RNN)]]  ----> [[Transformer]]
 
* [[Bidirectional Encoder Representations from Transformers (BERT)]]
 
* [[Bidirectional Encoder Representations from Transformers (BERT)]]
 
* [[Natural Language Processing (NLP)]]
 
* [[Natural Language Processing (NLP)]]
 
* [[Memory Networks]]
 
* [[Memory Networks]]
* [[Transformer-XL]]
+
* [[Google]] [[Transformer-XL]] ...T5-XXL ...[http://venturebeat.com/2021/01/12/google-trained-a-trillion-parameter-ai-language-model/ Google trained a trillion-parameter AI language model | Kyle Wiggers - VB]
* Tensor2Tensor (T2T) | Google Brain
+
* [http://github.com/huggingface/transformers  Transformers] provides state-of-the-art general-purpose architectures ([[Bidirectional Encoder Representations from Transformers (BERT)]], [[Generative Pre-trained Transformer]]-2 (GPT-2), RoBERTa, XLM, DistilBert, [[XLNet]]...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch. | GitHub
** [http://nlp.stanford.edu/seminar/details/lkaiser.pdf Tensor2Tensor Transformers: New Deep Models for NLP | Łukasz Kaiser]
+
* [[Generative]] Modeling
** [http://github.com/tensorflow/tensor2tensor/blob/master/docs/walkthrough.md Tensor2Tensor | GitHub]
+
* [http://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/ How do Transformers Work in NLP? A Guide to the Latest State-of-the-Art Models | Prateek Joshi - Analytics Vidhya]
** [http://github.com/tensorflow/tensor2tensor Tensor2Tensor Library | GitHub]
 
* [http://jalammar.github.io/illustrated-transformer/ The Illustrated Transformer | Jay Alammar]
 
* [[Sequence to Sequence (Seq2Seq)]]
 
 
 
Transformer Model - The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an [[Autoencoder (AE) / Encoder-Decoder]] configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. [http://arxiv.org/abs/1706.03762  Attention Is All You Need | A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin]
 
 
 
http://cdn-images-1.medium.com/max/800/1*BHzGVskWGS_3jEcYYi6miQ.png
 
  
<youtube>IxQtK2SjWWM</youtube>
 
<youtube>XrZ_Y4koV5A</youtube>
 
<youtube>OYygPG4d9H0</youtube>
 
<youtube>QuvRWevJMZ4</youtube>
 
  
== Key, Value, and Query ==
 
  
[http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf Attention is All you Need]” (Vaswani, et al., 2017), without a doubt, is one of the most impactful and interesting paper in 2017. It presented a lot of improvements to the soft attention and make it possible to do seq2seq modeling without recurrent network units. The proposed “transformer” model is entirely built on the self-attention mechanisms without using sequence-aligned recurrent architecture.
+
Transformer Model - uniquely have attention such that every output element is connected to every input element. The weightings between them are calculated dynamically, effectively. [http://venturebeat.com/2019/10/24/google-achieves-state-of-the-art-nlp-performance-with-an-enormous-language-model-and-data-set/ | Kyle Wiggers] The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an [[Autoencoder (AE) / Encoder-Decoder]] configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. [http://arxiv.org/abs/1706.03762  Attention Is All You Need | A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin]
  
The secret recipe is carried in its model architecture.
+
The Transformer is a deep machine learning model introduced in 2017, used primarily in the field of natural language processing (NLP). Like [[Recurrent Neural Network (RNN)]], Transformers are designed to handle ordered sequences of data, such as natural language, for various tasks such as machine translation and text summarization. However, unlike RNNs, Transformers do not require that the sequence be processed in order. So, if the data in question is natural language, the Transformer does not need to process the beginning of a sentence before it processes the end. Due to this feature, the Transformer allows for much more parallelization than RNNs during training. Since their introduction, Transformers have become the basic building block of most state-of-the-art architectures in [[Natural Language Processing (NLP)]], replacing gated recurrent neural network models such as the [[Long Short-Term Memory (LSTM)]] in many cases. Since the Transformer architecture facilitates more parallelization during training computations, it has enabled training on much more data than was possible before it was introduced. This led to the development of pretrained systems such as [[Bidirectional Encoder Representations from Transformers (BERT)]] and [[Generative Pre-trained Transformer (GPT)]]-2, which have been trained with huge amounts of general language data prior to being released, and can then be fine-tune trained to specific language tasks.[http://en.wikipedia.org/wiki/Transformer_(machine_learning_model) Wikipedia]
  
=== multi-head self-attention mechanism ===
+
== Tensor2Tensor (T2T) | Google Brain ==
The major component in the transformer is the unit of multi-head self-attention mechanism. The transformer views the encoded representation of the input as a set of key-value pairs, (K,V), both of dimension n (input sequence length); in the context of NMT, both the keys and values are the encoder hidden states. In the decoder, the previous output is compressed into a query (Q of dimension m) and the next output is produced by mapping this query and the set of keys and values.
+
* [http://nlp.stanford.edu/seminar/details/lkaiser.pdf Tensor2Tensor Transformers: New Deep Models for NLP | Łukasz Kaiser]
 +
* [http://github.com/tensorflow/tensor2tensor/blob/master/docs/walkthrough.md Tensor2Tensor | GitHub]
 +
* [http://github.com/tensorflow/tensor2tensor Tensor2Tensor Library | GitHub]
 +
* [http://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb # Welcome to the Tensor2Tensor Colab]
  
The transformer adopts the scaled dot-product attention: the output is a weighted sum of the values, where the weight assigned to each value is determined by the dot-product of the query with all the keys:
+
Tensor2Tensor, or T2T for short, is a library of deep learning models and datasets designed to make deep learning more accessible and [accelerate ML research](https://research.googleblog.com/2017/06/accelerating-deep-learning-research.html). T2T is actively used and maintained by researchers and engineers within the [Google Brain team](https://research.google.com/teams/brain/) and a community of users. This colab shows you some datasets we have in T2T, how to download and use them, some models we have, how to download pre-trained models and use them, and how to create and train your own models. | Jay Alammar]
  
Attention(Q,K,V)=softmax(QK⊤n−−√)V
 
  
=== Multi-Head Self-Attention ===
+
<img src="http://lilianweng.github.io/lil-log/assets/images/transformer.png" width="700" height="475">
multi-head scaled dot-product attention
 
  
http://lilianweng.github.io/lil-log/assets/images/multi-head-attention.png
 
 
Multi-head scaled dot-product attention mechanism. (Image source: Fig 2 in Vaswani, et al., 2017)
 
Multi-head scaled dot-product attention mechanism. (Image source: Fig 2 in Vaswani, et al., 2017)
  
Rather than only computing the attention once, the multi-head mechanism runs through the scaled dot-product attention multiple times in parallel. The independent attention outputs are simply concatenated and linearly transformed into the expected dimensions. I assume the motivation is because ensembling always helps? ;) According to the paper, “multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.”[http://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html Attention? Attention! | Lilian Weng]
+
<youtube>IxQtK2SjWWM</youtube>
 +
<youtube>S27pHKBEp30</youtube>
 +
<youtube>q7QP_lfqnQM</youtube>
 +
<youtube>AFkGPmU16QA</youtube>
 +
<youtube>rURRYI66E54</youtube>
 +
<youtube>cgrqWBWzKjI</youtube>

Revision as of 21:53, 13 January 2021

YouTube search... ...Google search


Transformer Model - uniquely have attention such that every output element is connected to every input element. The weightings between them are calculated dynamically, effectively. | Kyle Wiggers The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an Autoencoder (AE) / Encoder-Decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Attention Is All You Need | A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin

The Transformer is a deep machine learning model introduced in 2017, used primarily in the field of natural language processing (NLP). Like Recurrent Neural Network (RNN), Transformers are designed to handle ordered sequences of data, such as natural language, for various tasks such as machine translation and text summarization. However, unlike RNNs, Transformers do not require that the sequence be processed in order. So, if the data in question is natural language, the Transformer does not need to process the beginning of a sentence before it processes the end. Due to this feature, the Transformer allows for much more parallelization than RNNs during training. Since their introduction, Transformers have become the basic building block of most state-of-the-art architectures in Natural Language Processing (NLP), replacing gated recurrent neural network models such as the Long Short-Term Memory (LSTM) in many cases. Since the Transformer architecture facilitates more parallelization during training computations, it has enabled training on much more data than was possible before it was introduced. This led to the development of pretrained systems such as Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT)-2, which have been trained with huge amounts of general language data prior to being released, and can then be fine-tune trained to specific language tasks.Wikipedia

Tensor2Tensor (T2T) | Google Brain

Tensor2Tensor, or T2T for short, is a library of deep learning models and datasets designed to make deep learning more accessible and [accelerate ML research](https://research.googleblog.com/2017/06/accelerating-deep-learning-research.html). T2T is actively used and maintained by researchers and engineers within the [Google Brain team](https://research.google.com/teams/brain/) and a community of users. This colab shows you some datasets we have in T2T, how to download and use them, some models we have, how to download pre-trained models and use them, and how to create and train your own models. | Jay Alammar]


Multi-head scaled dot-product attention mechanism. (Image source: Fig 2 in Vaswani, et al., 2017)