Task-specific fine-tuning on pre-trained transformers has achieved performance breakthroughs in multiple NLP tasks. Yet, as both computation and parameter size grows linearly with the number of sub-tasks, such methods are increasingly difficult to adopt in the real world due to unrealistic memory and computation overhead on computing devices. Previous works on fine-tuning focus on reducing the growing parameter size to save storage cost by parameter sharing. However, compared to storage, the constraint of computation is a more critical issue with the fine-tuning models in modern computing environments; prior works fall short on computation reduction.
To enable efficient fine-tuning, we propose LeTS, a framework that leverages both computation and parameter sharing across multiple tasks. LeTS consists of two principles. First, LeTS decouples the computation dependency in traditional fine-tuning model by proposing a novel neural architecture to reuse the intermediate results computed from the pre-trained model and the input. Furthermore, we leverage differentiable neural architecture search to determine task-specific computation sharing scheme. Second, by treating the final weight parameters as a weight difference added to the pre-trained weight, we propose a novel early stage pruning approach to generate a mask at the beginning of fine-tuning. By combining these two principles, LeTS further reduces the computation demand by exploiting the sparsity feature of weight difference. Extensive experiments show that with 1.4% of extra parameters per task, LeTS reduces the computation by 49.5% on GLUE benchmarks with only 0.2% accuracy loss compared to the full fine-tuning method.