Decoupled Knowledge Distillation Explained

2023-02-10 426 words 2 minutes

Contents

Abstract

This artical introduces this paper and try to explain it.

Motivation

Currently, logit distillation is easy to use but not as effective as feature distillation. However, since logit is the most abstract and deep feature, it is reasonable to assume that it would achieve better results. Therefore, the authors believe that the potential of logit distillation has not been explored.

Method

From $KL$’s formula, the authors splite KD into $TCKD$(target-class KD) and $NCKD$(non-target class KD). The detail about it is attached at the detail.

From KL’s original formula, it can be introduced that $KD=TCKD+(1-p_{\text{not target}})NCKD$. If teacher’s model is reliable, the value of $p_{\text{not target}}$ will be very low, which limits the NCKD. therefore, the authors propose the following formula\[ \text{DKD}=\alpha\text{TCKD}+\beta\text{NCKD} \]

Conclusion

$NCKD$ is useful. The authors designed some experiments, as detailed in the paper, and concluded that $NCKD$ is also critical. However, it is not clear what $NCKD$ shows, it is just the so-called $\text{dark knowledge}$, which is very vague, i.e., lacking in explanation. The $TCKD$ is well explained, that is, the gap with the correct answer. The $TCKD$ indicates the difficulty of the data. Experiments show that the harder the data, the more effective $TCKD$ is.

Extra

dark knowledge

So, what’s dark knowledge?

dark knowledge is non-target classes’ feature and relationship. For example, supposed such thing:

class	prob
cat	0.05
dog	0.87
goose	0.08

There's no doubt it's classified as dog. However, why goose's probability is larger than cat's? This is dark knowledge.

usually, the dark knowledge can be discarded. However, sometimes, it shouldn't be discarded. Because, it's just the point the model has learnt to classify. So, if you do a knowledge transfer or KD, maybe you should keep the teacher’s dark knowledge.

Thanks for this post.

detail

Given a sample from a C-th class, $p=[p_1, p_2, …p_C]$. So, $p_i=softmax(z_i)$ ($z_i$ is the logit). Therefore, $$p_i=\frac{exp(z_i)}{ \sum_{i=1}^{C}{exp(z_i)} }$$

for a class, $p_j=\frac{exp(z_j)}{\sum_{i=1}^{C}{exp(z_j)}}$, and for the non-target class, $\mathring{p_j}=\frac{\sum_{i=1, i\neq{j}}^{C}{p_i}}{\sum_{i=1}^{C}{p_i}}$.

Assumed $\tilde{p_i}=\frac{p_i}{\sum_{i=1, i\neq{t}}^{C}{p_i}}$, then, \[ p_j=\tilde{p_j}\times\mathring{p_j} \]

SO, LET'S easy refactor! \[ KD=KL(p^{T}||p^{S}) \\ = p_t^T\log{\frac{p_t^T}{p_t^S} } + \sum_{i=1,i\neq{t} }^{C}{p_i^T\log{ \frac{p_i^T}{p_i^S} } } \\ = p_t^T\log{\frac{p_t^T}{p_t^S} } + \sum_{i=1,i\neq{t} }^{C}{\tilde{p_t}\mathring{p_t}(\log{\frac{\tilde{p_i^T} }{\tilde{p_i^S} } }+\log{\frac{\mathring{p_i^T} }{\mathring{p_i^S} } }) } \\ = p_t^T\log{\frac{p_t^T}{p_t^S} } + \mathring{p_i^T} \sum_{i=1,i\neq{t} }^{C}{(\tilde{p_t} \log{\frac{\tilde{p_i^T} }{\tilde{p_i^S} } }+\tilde{p_t} \log{\frac{\mathring{p_i^T} }{\mathring{p_i^S} } })} \] Because $ \log{\frac{\mathring{p_i^T} }{\mathring{p_i^S} } } $ is a constant value, go on: \[ KD= p_t^T\log{\frac{p_t^T}{p_t^S} } + \mathring{ p_i^T }\sum_{i=1,i\neq{t} }^{C}{\frac{\mathring{p_i^T} }{\mathring{p_i^S} } } + \tilde{p_t} \sum_{i=1,i\neq{t} }^{C}{\frac{\tilde{p_t^T} }{\tilde{p_t^S} } } \\ = KL(\text{binary classification}) + \tilde{p_t} KL (\text{non-target classification} ) \\ = TCKD+\tilde{p_t} NCKD \]

complete.