AdaMix is a parameter-efficient fine-tuning (PEFT) method for large language models that outperforms both full fine-tuning and existing PEFT approaches like LoRA and adapters. By using a mixture of adaptation modules with stochastic routing and merging, AdaMix trains only 0.1–0.2% of parameters while maintaining the same computational cost as baseline PEFT methods. This innovation dramatically reduces storage needs and boosts performance across NLU and NLG tasks, making it one of the most effective fine-tuning techniques to date.AdaMix is a parameter-efficient fine-tuning (PEFT) method for large language models that outperforms both full fine-tuning and existing PEFT approaches like LoRA and adapters. By using a mixture of adaptation modules with stochastic routing and merging, AdaMix trains only 0.1–0.2% of parameters while maintaining the same computational cost as baseline PEFT methods. This innovation dramatically reduces storage needs and boosts performance across NLU and NLG tasks, making it one of the most effective fine-tuning techniques to date.

How to Improve AI Models While Training Only 0.1% of Parameters

2025/10/01 15:00

:::info Authors:

(1) Yaqing Wang, Purdue University (wang5075@purdue.edu);

(2) Sahaj Agarwal, Microsoft (sahagar@microsoft.com);

(3) Subhabrata Mukherjee, Microsoft Research (submukhe@microsoft.com);

(4) Xiaodong Liu, Microsoft Research (xiaodl@microsoft.com);

(5) Jing Gao, Purdue University (jinggao@purdue.edu);

(6) Ahmed Hassan Awadallah, Microsoft Research (hassanam@microsoft.com);

(7) Jianfeng Gao, Microsoft Research (jfgao@microsoft.com).

:::

Abstract and 1. Introduction

  1. Background

    2.1 Mixture-of-Experts

    2.2 Adapters

  2. Mixture-of-Adaptations

    3.1 Routing Policy

    3.2 Consistency regularization

    3.3 Adaptation module merging and 3.4 Adaptation module sharing

    3.5 Connection to Bayesian Neural Networks and Model Ensembling

  3. Experiments

    4.1 Experimental Setup

    4.2 Key Results

    4.3 Ablation Study

  4. Related Work

  5. Conclusions

  6. Limitations

  7. Acknowledgment and References

Appendix

A. Few-shot NLU Datasets B. Ablation Study C. Detailed Results on NLU Tasks D. Hyper-parameter

Abstract

Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks requires updating hundreds of millions to billions of parameters, and storing a large copy of the PLM weights for every task resulting in increased cost for storing, sharing and serving the models. To address this, parameter-efficient fine-tuning (PEFT) techniques were introduced where small trainable components are injected in the PLM and updated during fine-tuning. We propose AdaMix as a general PEFT method that tunes a mixture of adaptation modules – given the underlying PEFT method of choice – introduced in each Transformer layer while keeping most of the PLM weights frozen. For instance, AdaMix can leverage a mixture of adapters like Houlsby (Houlsby et al., 2019) or a mixture of low rank decomposition matrices like LoRA (Hu et al., 2021) to improve downstream task performance over the corresponding PEFT methods for fully supervised and few-shot NLU and NLG tasks. Further, we design AdaMix such that it matches the same computational cost and the number of tunable parameters as the underlying PEFT method. By only tuning 0.1 − 0.2% of PLM parameters, we show that AdaMix outperforms SOTA parameter-efficient fine-tuning and full model fine-tuning for both NLU and NLG tasks. Code and models are made available at https://aka.ms/AdaMix.

1 Introduction

Standard fine-tuning of large pre-trained language models (PLMs) (Devlin et al., 2019; Liu et al., 2019; Brown et al., 2020; Raffel et al., 2019) to downstream tasks requires updating all model parameters. Given the ever-increasing size of PLMs (e.g., 175 billion parameters for GPT-3 (Brown et al., 2020) and 530 billion parameters for MTNLG (Smith et al., 2022)), even the fine-tuning step becomes expensive as it requires storing a full copy

\ Figure 1: Performance of different parameter-efficient fine-tuning methods on GLUE development set with RoBERTa-large encoder following a setup similar to (Houlsby et al., 2019) for fair comparison. We report the performance of Pfeiffer (Pfeiffer et al., 2021), Houlsby (Houlsby et al., 2019) and LoRA (Hu et al., 2021) with their default number of fine-tuned parameters as well as the number of fine-tuned parameters used in AdaMix with a mixture of adaptations . Red dash shows the performance of full model fine-tuning.

\ of model weights for every task. To address these challenges, recent works have developed parameterefficient fine-tuning (PEFT) techniques. These approaches typically underperform standard full model fine-tuning, but significantly reduce the number of trainable parameters. There are many varieties of PEFT methods, including prefix-tuning (Li and Liang, 2021) and prompt-tuning (Lester et al., 2021) to condition frozen language models via natural language task descriptions, low dimensional projections using adapters (Houlsby et al., 2019; Pfeiffer et al., 2020, 2021) and more recently using low-rank approximation (Hu et al., 2021). Figure 1 shows the performance of some popular PEFT methods with varying number of tunable parameters. We observe a significant performance gap with respect to full model tuning where all PLM parameters are updated.

\ In this paper, we present AdaMix, a mixture of adaptation modules approach, and show that it outperforms SOTA PEFT methods and also full model fine-tuning while tuning only 0.1 − 0.2% of PLM parameters.

\ In contrast to traditional PEFT methods that use a single adaptation module in every Transformer layer, AdaMix uses several adaptation modules that learn multiple views of the given task. In order to design this mixture of adaptations, we take inspiration from sparsely-activated mixture-of-experts (MoE) models. In traditional dense models (e.g., BERT (Devlin et al., 2019), GPT-3 (Brown et al., 2020)), all model weights are activated for every input example. MoE models induce sparsity by activating only a subset of the model weights for each incoming input.

\ Consider adapters (Houlsby et al., 2019), one of the most popular PEFT techniques, to illustrate our method. A feedforward layer (FFN) is introduced to down-project the hidden representation to a low dimension d (also called the bottleneck dimension) followed by another up-project FFN to match the dimensionality of the next layer. Instead of using a single adapter, we introduce multiple project-up and project-down FFNs in each Transformer layer. We route input examples to one of the project-up and one of the project-down FFN’s resulting in the same amount of computational cost (FLOPs) as that of using a single adapter. For methods like LoRA (Hu et al., 2021), that decomposes the gradient of pre-trained weights into low-rank matrices (A and B), we introduce multiple lowrank decompositions and route the input examples to them similar to adapters.

\ We discuss different routing mechanism and show that stochastic routing yields good performance while eliminating the need for introducing any additional parameters for module selection. To alleviate training instability that may arise from the randomness in selecting different adaptation modules in different training steps, we leverage consistency regularization and the sharing of adaptation modules during stochastic routing.

\ The introduction of multiple adaptation modules results in an increased number of adaptation parameters. This does not increase computational cost but increases storage cost. To address this, we develop a merging mechanism to combine weights from different adaptation modules to a single module in each Transformer layer. This allows us to keep the number of adaptation parameters the same as that of a single adaptation module. Our merging mechanism is inspired by model weight averaging model soups (Wortsman et al., 2022) and multi BERTs (Sellam et al., 2022). Weight averaging of models with different random initialization has been shown to improve model performance in recent works (Matena and Raffel, 2021; Neyshabur et al., 2020; Frankle et al., 2020) that show the optimized models to lie in the same basin of error landscape. While the above works are geared towards fine-tuning independent models, we extend this idea to parameter-efficient fine-tuning with randomly initialized adaptation modules and a frozen language model.

\ Overall, our work makes the following contributions:

\ (a) We develop a new method AdaMix as a mixture of adaptations for parameter-efficient fine-tuning (PEFT) of large language models. Given any PEFT method of choice like adapters and low-rank decompositions, AdaMix improves downstream task performance over the underlying PEFT method.

\ (b) AdaMix is trained with stochastic routing and adaptation module merging to retain the same computational cost (e.g., FLOPs, #tunable adaptation parameters) and benefits of the underlying PEFT method. To better understand how AdaMix works, we demonstrate its strong connections to Bayesian Neural Networks and model ensembling.

\ (c) By tuning only 0.1 − 0.2% of a pre-trained language model’s parameters, AdaMix is the first PEFT method to outperform full model fine-tuning methods for all NLU tasks on GLUE, and outperforms other competing methods for NLG and few-shot NLU tasks.

\ Practical benefits of PEFT methods. The most significant benefit of PEFT methods comes from the reduction in memory and storage usage. For a Transformer, the VRAM consumption can be significantly reduced as we do not need to keep track of optimizer states for the frozen parameters. PEFT methods also allow multiple tasks to share the same copy of the full (frozen) PLM. Hence, the storage cost for introducing a new task can be reduced by up to 444x (from 355MB to 0.8MB with RoBERTa-large encoder in our setting).

\ We present background on Mixture-of-Experts (MoE) and adapters in Section 2 of Appendix.

\

2 Background

2.1 Mixture-of-Experts

\

\ \ \ Figure 2: Mixture-of-Adaptations (AdaMix) with adapters (Houlsby et al., 2019) as the underlying PEFT mechanism. For illustration, we show M = 4 adaptation modules consisting of feedforward up (FFN_U) feedforward down (FFN_D) projection matrices. The above block shown for one Transformer layer is repeated across all the layers. AdaMix stochastically routes instances from an input batch via randomly selected adaptation modules resulting in FLOPs match to a single module with consistency regularization and parameter sharing. Adaptation merging (Figure 4) collapses multiple modules to match single-module parameters in each layer.

\ \ \ Figure 3: Conventional adapter design in standardTransformer architecture.

\ \ \

\

2.2 Adapters

The predominant methodology for task adaptation is to tune all of the trainable parameters of the PLMs for every task. This raises significant resource challenges both during training and deployment. A recent study (Aghajanyan et al., 2021) shows that PLMs have a low instrinsic dimension that can match the performance of the full parameter space.

\ To adapt PLMs for downstream tasks with a small number of parameters, adapters (Houlsby et al., 2019) have recently been introduced as an alternative approach for lightweight tuning.

\ \

\

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

The Channel Factories We’ve Been Waiting For

The Channel Factories We’ve Been Waiting For

The post The Channel Factories We’ve Been Waiting For appeared on BitcoinEthereumNews.com. Visions of future technology are often prescient about the broad strokes while flubbing the details. The tablets in “2001: A Space Odyssey” do indeed look like iPads, but you never see the astronauts paying for subscriptions or wasting hours on Candy Crush.  Channel factories are one vision that arose early in the history of the Lightning Network to address some challenges that Lightning has faced from the beginning. Despite having grown to become Bitcoin’s most successful layer-2 scaling solution, with instant and low-fee payments, Lightning’s scale is limited by its reliance on payment channels. Although Lightning shifts most transactions off-chain, each payment channel still requires an on-chain transaction to open and (usually) another to close. As adoption grows, pressure on the blockchain grows with it. The need for a more scalable approach to managing channels is clear. Channel factories were supposed to meet this need, but where are they? In 2025, subnetworks are emerging that revive the impetus of channel factories with some new details that vastly increase their potential. They are natively interoperable with Lightning and achieve greater scale by allowing a group of participants to open a shared multisig UTXO and create multiple bilateral channels, which reduces the number of on-chain transactions and improves capital efficiency. Achieving greater scale by reducing complexity, Ark and Spark perform the same function as traditional channel factories with new designs and additional capabilities based on shared UTXOs.  Channel Factories 101 Channel factories have been around since the inception of Lightning. A factory is a multiparty contract where multiple users (not just two, as in a Dryja-Poon channel) cooperatively lock funds in a single multisig UTXO. They can open, close and update channels off-chain without updating the blockchain for each operation. Only when participants leave or the factory dissolves is an on-chain transaction…
Share
BitcoinEthereumNews2025/09/18 00:09
American Bitcoin’s $5B Nasdaq Debut Puts Trump-Backed Miner in Crypto Spotlight

American Bitcoin’s $5B Nasdaq Debut Puts Trump-Backed Miner in Crypto Spotlight

The post American Bitcoin’s $5B Nasdaq Debut Puts Trump-Backed Miner in Crypto Spotlight appeared on BitcoinEthereumNews.com. Key Takeaways: American Bitcoin (ABTC) surged nearly 85% on its Nasdaq debut, briefly reaching a $5B valuation. The Trump family, alongside Hut 8 Mining, controls 98% of the newly merged crypto-mining entity. Eric Trump called Bitcoin “modern-day gold,” predicting it could reach $1 million per coin. American Bitcoin, a fast-rising crypto mining firm with strong political and institutional backing, has officially entered Wall Street. After merging with Gryphon Digital Mining, the company made its Nasdaq debut under the ticker ABTC, instantly drawing global attention to both its stock performance and its bold vision for Bitcoin’s future. Read More: Trump-Backed Crypto Firm Eyes Asia for Bold Bitcoin Expansion Nasdaq Debut: An Explosive First Day ABTC’s first day of trading proved as dramatic as expected. Shares surged almost 85% at the open, touching a peak of $14 before settling at lower levels by the close. That initial spike valued the company around $5 billion, positioning it as one of 2025’s most-watched listings. At the last session, ABTC has been trading at $7.28 per share, which is a small positive 2.97% per day. Although the price has decelerated since opening highs, analysts note that the company has been off to a strong start and early investor activity is a hard-to-find feat in a newly-launched crypto mining business. According to market watchers, the listing comes at a time of new momentum in the digital asset markets. With Bitcoin trading above $110,000 this quarter, American Bitcoin’s entry comes at a time when both institutional investors and retail traders are showing heightened interest in exposure to Bitcoin-linked equities. Ownership Structure: Trump Family and Hut 8 at the Helm Its management and ownership set up has increased the visibility of the company. The Trump family and the Canadian mining giant Hut 8 Mining jointly own 98 percent…
Share
BitcoinEthereumNews2025/09/18 01:33