Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners

1 University of Massachusetts Amherst 2 MIT-IBM Watson AI Lab 3 The University of Hong Kong
CVPR 2023

Abstract

Optimization in multi-task learning (MTL) is more challenging than single-task learning (STL), as the gradient from different tasks can be contradictory. When tasks are related, it can be beneficial to share some parameters among them (cooperation). However, some tasks require additional parameters with expertise in a specific type of data or discrimination (specialization). To address the MTL challenge, we propose model, a new model that is modularized into groups of experts (a 'squad'). This structure allows us to formalize cooperation and specialization as the process of matching experts and tasks. We optimize this matching process during the training of a single model. Specifically, we incorporate mixture of experts (MoE) layers into a transformer model, with a new loss that incorporates the mutual dependence between tasks and experts. As a result, only a small set of experts are activated for each task. This prevents the sharing of the entire backbone model between all tasks, which strengthens the model, especially when the training set size and the number of tasks scale up. More interestingly, for each task, we can extract the small set of experts as a standalone model that maintains the same performance as the large model. Extensive experiments on the Taskonomy dataset with 13 vision tasks and the PASCAL-Context dataset with 5 vision tasks show the superiority of our approach.


Key motivation: a sparse and strong dependence between experts and tasks

Our key motivation is that experts should leverage commonalities in some tasks (cooperation) but focus on a subset of tasks that require specific features and do not interfere with each other (specialization).

A comparison between Mod-Squad and MoE ViT.

A real visualization of the relation between task and experts:

A comparison between Mod-Squad and other MoE. The y-axis represents the tasks and the x-axis represents the 15 experts. Our frequency map is much sharp and sparse than other methods.


Mod-Squad Framework

A key design in our model is customizing MoE into the vision transformer so that each expert can construct a minimum part of the model that can be either shared between tasks or specialized for tasks.

The pipeline of our multi-task foundation model. Each transformer block in Mod-Squad consists a MoE attention network (MoE attn.) and a MoE MLP neswork. The multi-task model Mod-Squad is trained with our proposed mutual information loss.

Mutual Information between Experts and Tasks

Maximize mutual information develop a sharp and sparse relation between experts and tasks.


Extracting Sub-Network for an Individual Task

Mod-squad can be pruned without addition training and performance drop.

Strong Multi-task Foundation Model

Mod-squad can learn the relation between tasks and be a strong MTL learner.

Citation


@article{chen2022modsquad,
            title={Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners},
            author={Zitian Chen and Yikang Shen and Mingyu Ding and Zhenfang Chen and Hengshuang Zhao and Erik Learned-Miller and Chuang Gan},
            journal={CVPR},
            year={2023}
}