When the training dataset comprises a 1:1 proportion of dogs to cats, a generative model that produces 1:1 dogs and cats better resembles the training species distribution than another model with 3:1 dogs and cats. Can we capture this phenomenon using existing metrics? Unfortunately, we cannot, because these metrics do not provide any interpretability beyond "diversity". In this context, we propose a new evaluation protocol that measures the divergence of a set of generated images from the training set regarding the distribution of attribute strengths as follows.
Single-attribute Divergence (SaD) measures the divergence regarding PDFs of a single attribute. Paired-attribute Divergence (PaD) measures the divergence regarding joint PDFs of a pair of attributes. They provide which attributes the models struggle. For measuring the attribute strengths of an image, we propose Heterogeneous CLIPScore (HCS) which measures the cosine similarity between image and text vectors with heterogeneous initial points.
Figure 1 illustrates the evaluation metrics for two models with distinct properties. While Model 1’s generated images align closely with the training dataset, Model 2 exhibits a lack of diversity. Notably, in Figure 1a (gray box), Model 1 consistently outperforms Model 2 across all metrics. Yet, these metrics fall short in explicability; for example, they don’t highlight the overrepresentation of long hair
and makeup
in Model 2.
Addressing this gap, our paper proposes a methodology to quantify discrepancies between generated and training images, focusing on specific attributes. Figure 1b shows the concept of our alternative approach that measures the distribution of attribute strengths compared to the training set: while Model 1 offers a balanced attribute distribution akin to the training dataset, Model 2 overemphasizes long hair
and underrepresents beard
.
Figure 2 illustrates an overview of our metrics. We measure attribute strengths in all images in both two sets, and then calculate the divergence between two sets for each attributes.
Single-attribute Divergence (SaD) measures how much a generative model deviates from the distribution of each attribute in the training data. Such as if we have a dataset with dogs and cats, and a generative model only makes dog images, it is not an ideal model because it does not produce cats at all.
Paired-attribute Divergence (PaD) measures how much a generative model breaks the relationship between attributes in the training data, such as "babies do not have beards."
To construct metrics that quantify the differences between two sets of images in an intelligible way, we introduce Heterogeneous CLIPScore (HCS), an advanced variant of CLIPScore. Unlike CLIPScore, HCS captures the similarity between two modalities—image and text—by setting distinct starting points for text and image vectors.
Figure 3 illustrates CLIPScore and Heterogeneous CLIPScore, (a) CLIPScore (CS) evaluates the similarity between \(V^{CS}_{img}\) and \(V^{CS}_{Text}\) from the coordinate origin, where the angle between the two vectors is bounded, resulting in a limited similarity value. (b) HCS gauges the similarity between \(V^{HCS}_{img}\) and \(V^{HCS}_{Text}\) using the defined means of images \(C_\mathcal{X}\) and texts \(C_\mathcal{A}\) as the origin, the range of similarity is unrestricted. (c) shows flexible values of HCS compared to CS.
Comparing the performance of generative models. We computed each generative model's performance on our metric with their official pretrained checkpoints on FFHQ. We used 50,000 images for both GT and the generated set. We used USER attributes for this experiment.
Leveraging the superior sensitivity and discernment of our proposed metrics, we evaluate the performance of GANs and Diffusion Models (DMs) in Table above, the tendency of SaD and PaD align with other existing metrics. However four notable points emerge;
LDM with 50 steps v.s. LDM with 200 timesteps. With increased sampling timesteps,
(a) SaD of LDM gets worse,
(b) since making too many fine objects such as earrings
or necklace
.
necklaces
and earrings
.
This might elucidate the prevalence of attributes like young, makeup, woman, wavy hair
. We posit that a dense sampling trajectory engenders more high-frequency objects.
yellow fur
, black fur
) and on shape (e.g., pointy ears
, long tail
) within the LSUN Cat dataset.
group
1 or
plate
2. Particularly, SDv2.1 faces challenges in generating multiple people.
This concurs with common assertions about SDv2.1, even though it reports a better FID score. In certain instances, users have noted
SDv2.1's inferior performance compared to SDv1.53. More details are provided in
Appendix A.1.
1 e.g., A group of people is standing around a large clock. ↩
2 e.g., A table is set with two plates of food and a candle. ↩
3 For example, some users have reported SDv2.1's lower performance compared to SDv1.5 in a blog post
available at AssemblyAI Blog.
↩
For more detailed insights and contextual understanding of the findings, we encourage reading the full research paper.