Compressing high-capability Large Language Models (LLMs) has emerged as a favored strategy for resource-efficient inferences. While state-of-the-art (SoTA) compression methods boast impressive advancements in preserving benign task performance, the potential risks of compression in terms of safety and trustworthiness have been largely neglected. This study conducts the first, thorough evaluation of three (3) leading LLMs using five (5) SoTA compression techniques across eight (8) trustworthiness dimensions, encompassing a range of compression rates.
Our experiments highlight the intricate interplay between compression and trustworthiness, revealing some interesting patterns.
Understanding the trustworthiness of compressed models requires a comprehensive evaluation to gain insights. In this paper, we are interested in three questions: (1) What is the recommended compression method in the joint view of multi-dimensional trustworthiness and standard performance? (2) What is the optimal compression rate for trading off trustworthiness and efficiency? (3) In extreme compression rates (3-bit quantization), how will the compressed models perform according to our metrics?
Type | Method | Compression Rate | Weight Update | Calibration |
---|---|---|---|---|
Pruning | Magnitude | 50% (2:4) | β | weight |
Pruning | SparseGPT | 50% (2:4) | β | weight w/ 128 samples |
Pruning | Wanda | 50% (2:4) | β | weight & act. w/ 128 samples |
Quant. | GPTQ | 3,4,8-bit | β | weight w/ 128 samples |
Quant. | AWQ | 3,4,8-bit | β | act. w/ 128 samples |
To answer the above questions, we curate a set of state-of-the-art training-free pruning and quantization methods that can lead to true acceleration on hardware. We compress popular LLMs (LLAMA2 Chat, LLAMA2, and Vicuna) with 13 billion parameters (13b) with published model checkpoints at huggingface. Codes for preparing the models are available at the decoding-comp-trust/comp-trust repo. Our assessment is based on two benchmarks: MMLU (as the benign performance) and DecodingTrust including 8 trustworthy dimensions. The code checkpoint of DecodingTrust with some customization is available at here.
A leaderboard consisting of compressed and dense models are as follows. We include 8 trust dimensions: Stereotype, Privacy, Toxicity, Fairness, Adversarial Robustness (AdvGLUE++), Out-Of-Distribution (OOD) Robustness, Robustness to Adversarial Demonstrations (AdvDemo), and Ethics. A full comparision with other dense models, please refer to the DecodingTrust Leaderboard. Note that the results in standard DecodingTrust Leaderboard will be different from results here referring to our code checkpoint.
Scaling up the parameters of an LLM is believed to be a general strategy for enhancing various generation abilities, including reasoning, math, language understanding, etc. Existing supportive findings encourage people to train larger and larger models (Kaplan, et al., 2020). But serving models on consumer-grade GPUs contrarily demands more efficient and often smaller models. As a popular choice for deployment, 7b LLMs are suitably tailored to be accommodated by numerous consumer-grade GPUs.
The above figure present the relative score difference w.r.t. 13b models. Every model is compressed at a 50% rate that leads to a similar model size as the 7b model. Darker blue/red colors indicate more improvement/drops w.r.t. to the 13b dense models. Gray dots/lines per cell indicate significantly lower/higher refusal rates (over 10%) which cast biases in the actual opinion/knowledge of a model. Quantization appears to be the most effective solution with minimal loss both on trustworthiness and on benign performance.
The main takeaways are:
As a byproduct of our method, we can also solve the matting problem by ignoring samples that fall outside of a bounding box during rendering.
The above figure shows the effect of compressing LLAMA2 13b Chat to the low-bit region (lower than 8 as represented in the x-axis) will be less consistent with the dense model but the effect may be positive in some perspectives. Black/red lines indicate the performance of 13b and 7b dense models, respectively. Standard deviations are reported with fewer bits. Grey areas indicate drops over 5 points. Dash lines represent the +/- 5 points w.r.t. the scores of the 13b model. The main takeaways are:
We summarize the guidance for compressing a trustworthy LLM as follows.
In conclusion, this study offers novel insights into the trustworthiness of compressed Large Language Models (LLMs), highlighting the complex interplay between model efficiency and various dimensions of trustworthiness. Our comprehensive evaluation of state-of-the-art compression techniques unveils the unique impact of model compression on trustworthiness facets, emphasizing the potential of quantization in enhancing specific dimensions at a minimal cost. These findings provide a nuanced understanding of the trade-offs between the efficiency and trustworthiness involved in LLM compression. We envision our findings will pave the way for the development of efficient yet trustworthy AI language models.
@article{hong2024comptrust,
title={Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression},
author={Hong, Junyuan and Duan, Jinhao and Zhang, Chenhui and Li, Zhangheng
and Xie, Chulin and Lieberman, Kelsey and Diffenderfer, James
and Bartoldson, Brian and Jaiswal, Ajay and Xu, Kaidi and Kailkhura, Bhavya
and Hendrycks, Dan and Song, Dawn and Wang, Zhangyang and Bo Li},
journal={arXiv:2403.15447},
year={2024}
}