Autoencoder (AE) / Encoder-Decoder

From
Jump to: navigation, search

YouTube search... ...Google search


Autoencoders (AE) (or auto-associator, as it was classically known as) are somewhat similar to Feed forward neural networks (FFNNs) as AEs are more like a different use of FFNNs than a fundamentally different architecture. The basic idea behind autoencoders is to encode information (as in compress, not encrypt) automatically, hence the name. The entire network always resembles an hourglass like shape, with smaller hidden layers than the input and output layers. AEs are also always symmetrical around the middle layer(s) (one or two depending on an even or odd amount of layers). The smallest layer(s) is|are almost always in the middle, the place where the information is most compressed (the chokepoint of the network). Everything up to the middle is called the encoding part, everything after the middle the decoding and the middle (surprise) the code. One can train them using backpropagation by feeding input and setting the error to be the difference between the input and what came out. AEs can be built symmetrically when it comes to weights as well, so the encoding weights are the same as the decoding weights. Bourlard, Hervé, and Yves Kamp. “Auto-association by multilayer perceptrons and singular value decomposition.” Biological cybernetics 59.4-5 (1988): 291-294.

ae.png

transformer_resideual_layer_norm_3.png


Is there a difference between autoencoders and encoder-decoder

Provided by Alexander Ororbia

Here is how I would view these two terms (informally). Think of the encoder-decoder as a very general framework/architecture design. In this design, you have some function that maps an input space, whatever it may be, to a different/latent space (the “encoder”). The decoder is simply the complementary function that creates a map from the (encoder’s) latent space to another target space (what is it we want to decode from the latent space). Note by simply mapping spaces, and linking them through a shared latent space, you could do something like map a sequence of tokens in English (i.e., an English sentence) to a sequence of tokens in French (i.e., the translation of that English sentence to French). In some neural translation models, you map an English sequence to a fixed vector (say the last state, found upon reaching a punctuation mark, of the recurrent network you use to process the sentence iteratively), from which you will decode to a French sequence.

An autoencoder (or auto-associator, as it was classically known as) is a special case of an encoder-decoder architecture — first, the target space is the same as the input space (i.e., English inputs to English targets) and second, the target is to be equal to the input. So we would be mapping something like vectors to vectors (note that this could still be a sequence, as they are recurrent autoencoders, but you are now in this case, not predicting the future but simply reconstructing the present given a state/memory and the present). Now, an autoencoder is really meant to do auto-association, so we are essentially trying to build a model to “recall” the input, which allows the autoencoder to do things like pattern completion so if we give our autoencoder a partially corrupted input, it would be able to “retrieve” the correct pattern from memory.

Also, generally, we build autoencoders because we are more interested in getting a representation rather than learning a predictive model (though one could argue we get pretty useful representations from predictive models as well…).

But the short story is simple: an autoencoder is really a special instance of an encoder-decoder. This is especially useful when we want to decouple the encoder and decoder to create something like a variational autoencoder, which also frees us from having to make the decoder symmetrical in design to the encoder (i.e., the encoder could be a 2-layer convolutional network while the decoder could be a 3-layer deconvolutional network). In a variational autoencoder, the idea of latent space becomes more clear, because now we truly map the input (such as an image or document vector) to a latent variable, from which we will reconstruct the original/same input (such as the image or document vector).

I also think a great deal of confusion comes from misuse of terminology. Nowadays, ML folk especially tend to mix and match words (some do so to make things sound cooler or find buzzwords that will attract readers/funders/fame/glory/etc.), but this might be partly due to the re-branding of artificial neural networks as “deep learning” ;-) [since, in the end, everyone wants the money to keep working]

Masked Autoencoder