## Abstract

The goal of two-sample tests is to decide whether two probability distributions, denoted by P and Q, are equal. One alternative to construct flexible two-sample tests is to use binary classifiers. More specifically, pair n random samples drawn from P with a positive label, and pair n random samples drawn from Q with a negative label. Then, the test accuracy of a binary classifier on these data should remain near chance-level if the null hypothesis “P = Q” is true. Furthermore, such test accuracy is an average of independent random variables, and thus approaches a Gaussian null distribution. Furthermore, the prediction uncertainty of our binary classifier can be used to interpret the particular differences between P and Q. In particular, analyze which samples were correctly or incorrectly labeled by the classifier, with the least or most confidence.

In this paper, we aim to revive interest in the use of binary classifiers for two-sample testing. To this end, we review their fundamentals, previous literature on their use, compare their performance against alternative state-of-the-art two-sample tests, and propose them to evaluate generative adversarial network models applied to image synthesis.

As a by-product of our research, we propose the application of conditional generative adversarial networks, together with classifier two-sample tests, as an alternative to achieve state-of-the-art causal discovery.