Disentangled Representations

\overbrace{\mathbb E_{p_\phi(\mathbf z \vert \mathbf x)}}^{\text{Samples}} [ \underbrace{-\log q_{\mathbf \theta} ( \mathbf x\vert \mathbf z)}_{\text{reconstruction loss}} ] + {\color{orange}\beta}\; \sum_i \underbrace{\text{KL} \left(p_\phi( \mathbf z_i\vert \mathbf x)\:\vert \vert\:prior(\mathbf z_i) \right)}_{\text{compactness prior loss}}

Trying to disentangle a complicated feature space into a simpler latent representation


Code available at this repo


Beta-VAE (higgins et al. 2017) - adds hyperparameter beta to weight the compactness prior term


Beta-VAE H (burgess et al. 2018) - adds hyperparameter C to control the compactness prior term




Factor-VAE (kim & minh, 2018) - adds total correlation loss term


Beta-Total-Correlation VAE (chen et al. 2018) - same objective as factor-vae, but computed without a discriminator


TRIM (singh et al. 2020) - yields attribution on transformations to learn simpler representations





\text{encoder}: p_\phi( \mathbf z\vert \mathbf x)


(chen et al. 2016)

(Nonlinear) ICA

(khemakhem et al. 2020)

\text{decoder}: q_{\mathbf \theta} ( \mathbf x\vert \mathbf z)
\underbrace{\sum_i I(x; z)}_{\text{mutual info}} + \underbrace{\text{KL} \left(q_\phi( \mathbf z_i)\:\vert\vert\:prior(\mathbf z_i) \right)}_{\text{factorial prior loss}} + \; \underbrace{\text{KL} \left( q_\phi(\mathbf z\vert \mathbf x) \vert \vert \prod_i q_\phi( \mathbf z_i\vert \mathbf x) \right)}_{\text{total correlation loss}}
\mathbf x
\mathbf{ \hat x}
\mathbf z
\mathbf \epsilon

encourages accurate reconstruction of the input

(note could do this w/ something smarter than pixel loss)

encourages points to be compactly placed in space;

this term can be further divided into 3 terms:

encourages latent variables to be independent

encourages mutual info between input and latent code to be high for a subset of the latent variables
\textbf{assumptions}\\ (1) \; X\approx f(z)\\ (2) \; \text{non-gaussianity of z}\\ (3) \; \text{independence: } P(z) = \prod_i P(z_i)


maximize non-gaussianity of z or minimize mutual info between its components

\textcolor{orange}{\beta}\; \vert\sum_i \underbrace{\text{KL} \left(p_\phi( \mathbf z_i\vert \mathbf x)\:\vert\vert\:prior(\mathbf z_i) \right)}_{\text{compactness prior loss}} -C\vert
preserves information between the latent space + input
encourages latent space to be decoupled


details + code


(kingma & welling, 2013)


(pidhorskyi et al. 2020)

StyleGan + StyleGan2

(karras et al. 2019)

disentangles by using latent representation at different scales

+ TRIM loss

(singh et al. 2020)

penalizes interpretations to be desirable (e.g. sparse, monotonic)


+ prediction loss

(singh et al. 2020)

if we are given a trained predictor, we can minimize its error rather than simply reconstructing the input