A neural network that creates images from text captions.
Go to the website and play with the captions!
Let's start with some definitions:
Confused?
Refer to the below figure to understand the use of different variables in different machine learning techniques.
These kind of networks are used to learn the internal structure of an input and encode it in a hidden internal representation (
We already learned how to train energy-based models, let's look at the below network:
Here instead of computing the minimization of the energy
Then the hidden representation is converted into
Basically,
Note that, here
This is called Autoencoder. Enocder is performeing amortizing and we dont have to minimize the enerzy
Below are the two examples of reconstruction energies:
In the case of binary input, we can simply use binary cross-entropy
Average across all training samples of per sample loss function
The size of the hidden representation
If we choose a smaller
In some situations it can be useful to have a larger than input
To prevent the model from collapsing, we have to employ techniques that constrain the amount of region which can take zero or low energy values. These techniques can be some sort of regularization such as sparsity constraints, adding additional noise, or sampling.
We add some augmentation/corruption like Gaussian noise to an input sampled from the training manifold 



