In the previous post, I described Unsupervised Domain Adaptation (UDA) method. In this post, I summarize an article about a UDA method. The article title is “Reusing the Task-specific Classifier as a Discriminator: Discriminator-free Adversarial Domain Adaptation” by Lin Chen et al. which was presented in CVPR 2022.
The key innovation in this article is reusing the task-specific classifier as a discriminator, paired with a novel Nuclear-norm Wasserstein discrepancy (NWD). Let’s dive into why this matters and how it works.
Training datasets are not always best representatives of all possible variety of changes that a deep neural network is supposed to work. A shift between training and test datasets could significantly affect the performance of a deep neural network. To address this problem, domain adaptation methods are invented to transfer knowledge from a labeled source domain (e.g., synthetic images) to an unlabeled target domain (e.g., real photos).

There are Two main branches of domain adaptation
The first group is called Moment matching methods where the distributions of the source and target domains are aligned by minimizing the difference in their statistical moments (e.g., mean, variance).
Instead of direct instances of source and target, features from a deep neural network are commonly used.
The process involved in moment matching method include considering a statistical moment (e.g., first-order mean, second-order variance/covariance) of features from the source and target. Then a loss function is used to reduce the difference between the moments of the source and target domains.
Three examples of moment matching methods include Maximum Mean Discrepancy where mean is used as the statistic, Correlation Alignment or CORAL where a second order statistics is used and moment distance,
In Adversarial learning methods a min-max two-player game is played to learn the distributions of source and target, so on one hand a classifier is trained to do the task-specific detection, and on the other hand a discriminator is trained to be fooled if a sample is drawn from source or target distribution.
The Existing adversarial methods use either an extra discriminator or two classifiers, but these methods can suffer from mode collapse or ambiguous predictions.
In this paper “The authors asked: Can we simplify this by reusing the classifier itself as a discriminator?
To this to happen, they suggested a norm that they called Nuclear-norm Wasserstein Discrepancy as a regularize. Therefore, in addition to the classification loss that is typical for a classification problem, an adversarial norm is considered.


to understand the norm that the authors used, we need to take a deeper look at the concept and intuitions they employed to formulate the problem. They calculated the self-correlation matrix of predictive probabilities for both source and target dataset. They observed that the diagonal elements of the self-correlation matrix in the source dataset show strong signal, while the sum of off-diagonal elements is small. It is contradictory to the target dataset that the sum of diagonal elements is small and off-diagonals start to appear due to a shift between the two domains.
Consequently, the authors used a metric to regularize this phenomena to compensate for the domain shift. Above you can see the formulation for creating the self-correlation matrix from prediction matrix that contains predictive probabilities as its rows. We can easily add up the diagonal elements of self-correlation matrix, which is equal to the Frobenius norm of R matrix. The sum of off-diagonal elements is stated as I_e.
Finally, the new domain discrepancy measure is stated as the difference between these two numbers.
The nice property of this discrepancy is that it is still equals the Frobenius norm of Z with a constant shift that could be ignored. Therefore, the correlation critic function which is used in adversarial learning could be the Frobenius norm of the prediction matrix

Training adversarial networks is well known for being delicate and unstable. In some cases, the classifier fails to sufficiently train to cover a broad range of data, and only few dominant features are extracted. This problem is called mode dropping that is partially addressed by using Wasserstein distances in the optimization process.
The authors believe that using Frobenius norm could also lead to a mode dropping. Therefore, they suggest replacing the Frobenius norm with the nuclear norm. They called it Nuclear-norm Wasserstein discrepancy (NWD).
The NWD enables the adversarial UDA paradigm to satisfy the K-Lipschitz constraint (to avoid exploding gradient) without the need to set up an additional weight clipping or gradient penalty. the authors claim that nuclear norm maximizes prediction rank that leads to better diversity. Therefore, unlike WGAN there is No need for weight clipping or gradient penalty.

We can see the structure of the DALN network on the bottom right corner of this slide. A pre-trained Resnet was used as the bottle-neck, and a fully connected layer with a SoftMax layer was used as the task-specific classifier.
We can see the structure of the ResNet on top right corner of this figure which mainly consists of convolution layers.
There are two losses that add up, the classification loss and the adversarial loss that is the reguliarizor based on the nuclear norm. DALN can be trained by playing the min-max game of the loss function.

Only one parameter in the DALN regularization term (lambda) is used to balance the effect of regularization.
The regularization term has the potential to be integrated into other domain adaptation techniques, e.g. if we have a method with two cost terms namely classification cost and method specific cost, in the modified approach, a 3rd term is added.

In the experimental setup 4 datasets were considered.
OfficeHome, is a dataset with more than 15 thousand images in 65 categories. This dataset contains four extremely different domains, The Office-31 is a dataset of approximately 4 thousand images in three domains each domain contains 31 categories. Image CLEF has three domains derived from three public datasets: Caltech (C), ImageNet (I), and Pascal (P) 12 categories, Each contain 50 images. VisDA-2017 is a large-scale synthetic-to-real dataset that contains 280K images in 12 classes

We can see the most important results of the article in the above figure. In Office-Home dataset, DALN is superior in one case, however, the MCC with NWD is the winner model in most cases. Despite being better, the overall score has improved by ~1% . In VisDA-2017, MCC with NWD outperformed DALN with 2.9% more accuracy. It is the most accurate model in this dataset. In Office-31 and ImageCLEF-2014, MCC+NWD wins in most categories, with an average improvement of 1%.
Article Strengths
I will state the biggest strengths and weakness of the article in my opinion. A domain discrepancy measure is empirically presented and demonstrated to be a good mean for regularizing some domain adoption problems. The structure of the proposed DALN model is simpler than its rival models, yet it seems to be efficient. Nuclear Norm is presented as a regularizor for other domain adaptation methods as a plug-and-play regularization term. t-SNE visualization of the classes shows clearly why domain adaptation is an essential concept to handle domain shift between source and target.
Possible problems
The statement of possible problems is just my opinion, so you can come up with your own list of pros. and cons.
The presented model (DALN) has structural problems, e.g., Vision Transformers (ViT) could be used in the back-bone instead of convolutional layers. DALN is stated as the main focus of the article, however, MCC when regularized with the nuclear norm achieved the best results and outperformed DALN model with a small margin. It is therefore not clear why the focus of the article is on DALN instead on the presenting the regularization itself. Theoretical statements are not sufficient and accurate. Some technical combinations don’t make much sense and some proofs are missing.
Moving from Frobenius norm to Nuclear norm seems intuitively correct, but it is not sufficiently justified. If we consider the Nuclear norm as a L1 counter-part in Euclidian space, then moving from L2 to L1 sometimes improves solving an ill-posed inverse problem. When the source domain is used as the base for self-correlation, the model will be prone to possible problems existed in source such as significant domain shift. Figure 4. in the article demonstrates the confusion matrices of different methods of the target domain. It is not clear why the confusion matrix for MCC+NWD (that is the winner and the most important model) is missing in this figure. The possible weaknesses of the model are completely missed. The model is created based on strong assumptions about similarity of the overall shapes of source and target distributions, so it is not hard to imagine cases where this assumption is failed that will lead to a failure by the method. The ablation analysis is also missing to show us the isolated effect of using the NWD term.