Google generative model, you’ll find several first-page ML blog posts telling you that generative model is an unsupervised learning or uses unsupervised learning to learn the distribution of the data. This is not true. With the hype of VAE and GAN, it is easy to assume that generative models = unsupervised learning. There exist generative models that can also be trained using a supervised approach. In this post, I aim to show some examples of generative models. We are going to see that generative models are more of a spectrum than a single category. This post is also aimed to help me strengthen my understanding.


Discriminative vs Generative

Discriminative models directly capture the conditional probability of \(p(y|x)\). Given a feature \(x\) and a label \(y\), the model can then discriminate between different input instances. Discriminative models are pervasive in the machine learning community.

Generative models, on the other hand, capture the joint probability \(p(x,y)\). Hence, during predictions, generative model has the data generating process information compared to a discriminative model. Now one might wonder, what about the deep generative models like VAE, GAN, or flow-based? All these model do not have a label \(y\) (though some do depending on the task). But they are still categorized as generative model because they can generate samples after you train the model. These models capture \(p(x)\). In short, if after training you can generate a new data by sampling from your model, then your model is a generative model.

Generative models themselves differ in many ways. There are many ways to model the data generating process, depending on the design choice. It can be supervised or unsupervised, it can be factorized differently depending on the architecture design, etc. Below I am going to list some examples but these are not an exhaustive list of generative models. I am leaving out the technical depth out of this post and only discuss the high-level overview.


Examples of Generative Model

Naive Bayes (NB)

NB captures \(p(x,y)\) by factorizing it into \(p(x|y) p(y)\). The naive part comes from the assumption that every features \(x_i\) are conditionally independent from each other given a category \(y\), hence

\[p(x|y) p(y) = p(y) \prod_{i} p(x_{i}|y)\]

\(p(x|y)\) can be represented by a distribution. For example, in a text classification task1, one can use multinomial distribution2 over distinct words given \(y\) (the parameters of \(p(x|y)\) is of \(y \times v\) matrix, where \(v\) is total number of vocab). The classification rule is \(argmax_y p(x|y) p(y)\). For \(p(y)\), it is usually an empirical categorical distribution of \(y\), whereas one can use maximum likelihood to estimate the parameter of \(p(x|y)\). In the text classification example, one could sample words from the multinomial distribution given a label \(y\). But it is not clear how many \(k\) words you want in the instances, so I think the \(k\) needs to be decided. The sampling will give us garbled texts since there is no notion of sequence (also nobody train NB to generate new samples). To make a prediction, we use the classification rule. There is a sequential version of NB: hidden markov model (HMM) but I am not discussing it in this post.

Latent Dirichlet Allocation (LDA)

LDA is widely used in topic modeling task where the objective is to automatically retrieve hidden topics for a given documents3. LDA assumes that a document is composed by a mixture of topics, and these topics generate the series of word inside the documents. The topic is a latent variable where one cannot observe during training. Given this assumption, LDA by definition is a hierarchical bayesian model. Below is the graphical model visualization:

Source: original paper

There are three main parameters of this model4: \(\theta\), for the document-topic distribution (\(M \times K\) matrix), \(\beta\), for the topic-word distribution (\(K \times V\) matrix), and \(z_n\), for the word assignment (\(K\) vector per document-word level). The objective of LDA is to find the best value for these parameters given the data that we have. This is a bayesian inference task where the goal is to calculate the posterior distribution for these parameters. We cannot calculate the posterior analytically because computing the denominator is intractable. Hence to approximate the posterior, most advocate for either markov chain monte carlo (mcmc) or variational inference (VI). The original LDA paper uses VI but there exists some MCMC approaches as well. For the VI approach, they use mean-field variational inference5, which breaks the interdependence between each latent variables, making it easier to calculate. After the bayesian inference is done, we get the approximated posterior distribution for \(\theta\), \(\beta\), and \(z_n\). We can use these distributions to generate new sample after training, following the data generating process above. The sampling, like in NB, will result in garbled texts as there is no notion of sequence here. Given a document consisting of words, we can use LDA to get the most likely topics for the document by generating the mixture of topic first using \(\theta\) and then greedily assign given word to a topic based on \(\beta\) and \(z_n\).

Generative RNN for Text Classification

Generative RNN uses the same factorization \(p(x|y) = p(x|y) p(y)\) as in NB. The difference here is that since the model is an RNN, \(p(x|y)\) can be further factorized into \(\prod_{i} p(x_i| x_{<i}, y)\). During training, the label \(y\) is used as an input to the RNN, transformed into its own trainable embedding. The authors experimented with two models: sharing the parameters for every label or uses different parameters for each label. During prediction, they compute \(argmax_y p(x|y) p(y)\) using empirical categorical distribution for \(p(y)\). Sampling from generative RNN is pretty straightforward. We can just first sample the label and starts the sequential process of RNN that generates a new data instances one token at a time. To stop the generative process one might include EOS token or determine the sentence length in advance.

Generative RNNG (Recurrent Neural Network Grammar)

In my opinion, RNNG is a cool model. RNNG design allows it to model \(p(x,y)\) directly without any factorization. It is based on a top-down generation algorithm that relies on a stack data structure. We can imagine that at each timesteps, RNNG generates an action that produces both words \(x\) and a well-formed parse tree \(y\). I’m omitting a lot of details here, but there are 3 main actions in the generative version:

  • NT(s), which generates a non-terminal S
  • GEN(x), which generates a word
  • REDUCE, which completes a partially open subtree in the stack

For the discriminative version \(p(y|x)\), GEN(x) is replaced by SHIFT action that consumes a given sentence from a buffer data structure. Sampling a parsed sentence from this model is easy. We can run the sequential process of the RNNG and let it run until it hits the stop token, yielding a well-formed parse tree with a sentence. In the original paper, besides evaluating RNNG on a syntactic parsing task, the generative RNNG is also used to measure the perplexity in a language model setting. In order to evaluate RNNG as a language model, they do MCMC approximation, importance sampling, to calculate the marginal probability \(p(x) = \sum_y p(x, y)\). For the parsing task, from the samples of the importance sampling above, they retrieve the MAP parse tree which is the tree \(\hat{y}\) that has the highest probability under the joint probability \(p(x,y)\).

Variational Autoencoder (VAE)

When I first studied VAE, I view it from a deep learning perspective: an autoencoder which maps \(x\) to a latent space \(z\) (encoding) and then maps it back to the data space \(x\) (decoding), Aside from the reconstruction loss, VAE introduces regularization term on the encoder network to be close to a multivariate gaussian, since VAE maps to a distribution rather than a single point. VAE has a bayesian interpretation that explains why there are two losses and why the encoder-decoder come into existence. Given a data point \(x = (x_1, .. x_n)\), we assume that there is a latent variable \(z_i\) for each \(x_i\) and there is a joint distribution \(p(x,z) = p(x|z) p(z)\) which explains where the data is coming from. In VAE, \(p(z)\) is a multivariate gaussian and \(p(x|z)\) is parameterized by a neural network. Now the objective here is to do a bayesian inference: we want to find the best \(z\) given an observed data, or in other words, find the posterior distribution:

\[p(z|x) = \frac{p(x|z)p(z)}{p(x)}\]

Calculating exact \(p(x)\) is intractable, cause it requires us to marginalize all possible \(z\). Hence we can approximate the posterior by introducing variational distribution \(q(z|x)\). Ignoring some math derivation, we arrive at this objective function6:

\[\log p(x) \geq \mathbb{E}_{z \sim q}[\log p_{\theta}(x|z)] - KL(q_{\phi}(z|x)||p(z))\]

where the left term is known as the ELBO and the right term is KL divergence7 between \(q_{\phi}(z|x)\) and \(p(z)\). Notice that in VAE, the latent variable \(z\) is produced by the same parameter of the encoder network. This is known as an amortized inference. Here it becomes clear that \(q_{\phi}(z|x)\) is the encoder network and \(p_{\theta}(x|z)\) is the decoder network. We can interpret ELBO as the reconstruction error and the negative KL divergence as the regularization term. We have arrived from the bayesian interpretation to the deep learning perspective. Once the model is trained, we can sample from the gaussian distribution \(p(z)\) and then feed it into our decoder network \(p_{\theta}(x|z)\).

Generative Adversarial Network (GAN)

When it first came out, GANs are widely popular for generating good quality image. GAN attempts to model \(p(x)\) by introducing two agents: a discriminator \(D\) and a generator \(G\). The generator learns to generate plausible data, and the discriminator learns to distinguish generator fake’s data from the real data. During training, the generator generates fake data and the discriminator has to predict which one is fake. As training progresses, the generator gets better at generating data and it becomes harder for the discriminator to tell which one is fake. At the end of training, \(p(x)\) is modeled by the generator \(G\) hence we can directly sample new data from this network. In this setting GAN is trained in an unsupervised way because we can create the label automatically for the discriminator.

Steps:

  • We introduce \(p(z)\), a distribution of noise which we can sample from (uniform, normal, etc).
  • We feed the sampled z into the generator \(G(z)\) and it will output a data point (an image in the original paper). Consequently, the generator gets to have a probability distribution over the data that it generates \(p_g(x)\).
  • We present the data from \(p_g(x)\) and \(p_r(x)\), which is a probability distribution over the real data, to the discriminator network.

Both \(D\) and \(G\) are playing a minimax game in which the loss function is as follow:

\[\min_G \max_D L(D, G)= \mathbb{E}_{x\sim p_{r}(x)}[\log D(x)] + \mathbb{E}_{x\sim p_{g}(x)}[\log(1 - D(x))]\]

where the right term is equal to \(\mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z)))]\)

With further math derivations, it can be shown that the above loss function is a Jensen-Shannon divergence8 (JS divergence) that quantifies similarity between the fake data distribution \(p_g(x)\) and the real data distribution \(p_r(x)\). If \(D\) and \(G\) are at their optimal values when \(p_g(x) = p_r(x)\) then the JS divergence is minimized at 0.


Closing

To conclude, generative models allow us to sample from the model after training because we design a data generating process assumption into the model. We have seen that generative model can be design in many ways, for example: factorizing into \(p(x|y) p(y)\) with a label \(y\), designing a model to capture the joint \(p(x,y)\) directly, assuming a latent variable, and a zero-sum game setting. There are plenty of another approaches in generative models that I do not mention in the post, such as autoregressive9 models and flow-based models (I might create a part 2 related to this some other time).

Thank you for reading my post! Please contact me on twitter if you spot any mistake.


Footnote

  1. NB is not tied to a text based problem.
  2. One can use different distribution, such as gaussian and categorical for \(p\\(x\\|y\\)\). For gaussian and categorical, the distribution has different parameter for every feature \(x_{i}\). This means that we have different distribution for each \(p\\(x_{i}\\|y\\)\), instead of the same distribution for every \(x_{i}\) in the text classification examples above.
  3. LDA is not tied to a text based problem.
  4. This post excludes other parameters in the LDA. In practice, other parameters in LDA (\(\alpha\), \(\eta\)) can also be approximated using either VI or MCMC, depending on the design choice. Refer to the original paper for more details.
  5. What still confuses to me is that if one would use mean-field VI for VAE instead of amortized, the mean-field VI would separate the variational distribution \(q\) per data-point level, instead of per latent variables. I don’t have an answer to this yet.
  6. This is the same objective that we get if we use variational inference in LDA. The difference is in VAE \(q\) is parameterized by neural network, whereas in LDA \(q\) is parameterized with the same distribution as the true parameter. The learning is then achieved via variational EM.
  7. KL divergence equals to 0 when \(q_{\phi}\\(z\\|x\\) = p\\(z\\)\). And it is not symmetric.
  8. JS divergence is symmetric.
  9. They’re not the same as recurrent in RNN.