The competition will provide the participants with a list of expert models that have already been trained on a task-specific dataset. All of these models will be publicly available on the Hugging Face Model Hub with licenses that permit their use for research purposes. These models can either be fully fine-tuned models or models obtained by parameter-efficient fine-tuning methods such as LoRA. Models on this list will be required to satisfy the following criteria: (1) model size $\leq 8$B parameters, and (2) model with licenses compatible with research use (e.b., MIT, Apache 2 etc). The goal of this competition is to re-use the provided models to create a generalist model that can perform well on a wide variety of skills like reasoning, coding, maths, chat, and tool use. This list of models will include popular pre-trained models such as LLaMA-7B, Mistral-7B, and Gemma-7B.

Along with these expert models, we also plan to provide two different types of datasets: (1) a list of re-calibration datasets to either tune the hyperparameters of merging methods, to perform additional training steps, to learn a routing, or to calibrate the final model, (2) a set of validation tasks that can be used to evaluate the final method’s performance. The datasets will be released as part of the starter kit for the participants and are already hosted on the Hugging Face Hub with a permissive license. Apart from these, we will have two sets of hidden tasks that will be used to evaluate the submissions from participants: (1) a set of leaderboard ranking test tasks, and (2) a set of final ranking test tasks. The leaderboard ranking tasks will have some overlap with the test set tasks to provide an additional signal to the participants.
The validation datasets are chosen to measure the time and space efficiency of the merging method. They are not meant to benchmark the performance of the merging method.
We will not collect or release any new datasets for training or evaluation as part of this competition.

Validation Datasets Lists:

The main purpose of validation datasets is to measure the time and space efficiency. We are using a hidden list of test datasets to measure the performances.