Learning to Specialize with Knowledge Distillation for Visual Question Answering
Jonghwan Mun, Kimin Lee, Jinwoo Shin and Bohyung Han
Visual Question Answering (VQA) is a notoriously challenging problem because it involves various heterogeneous tasks defined by questions within a unified framework. Learning specialized models for individual types of tasks is intuitively attracting but surprisingly difficult; it is not straightforward to outperform naive independent ensemble approaches. We present a principled algorithm to learn specialized models with knowledge distillation under a multiple choice learning framework. The training examples are dynamically assigned to a subset of models for specializing their functionality. The assigned and non-assigned models are learned to predict ground-truth answers and imitate their own base models before specialization, respectively. Our approach alleviates the problem of data deficiency, which is a critical limitation in existing frameworks on multiple choice learning, and allows each model to learn its own specialized expertise without forgetting general knowledge by knowledge distillation. Our experiments show that the proposed algorithm achieves the superior performances compared to naive ensemble methods and other baselines in VQA. Our framework is also effective for more general tasks, e.g., image classification with a large number of labels, which is known to be difficult under existing multiple choice learning schemes.