Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

1University of Texas, Austin, 2Drexel University 3MIT 4University of Illinois Urbana-Champaign 5Duke University 6Lawrence Livermore National Laboratory 7Center for AI Safety 8University of California, Berkeley 9University of Chicago
Present at ICML 2024
*Indicates Equal Contribution

We assess the trustworthiness of compressed LLMs with 3 leading base models, 5 SoTA compression methods and 8 trust dimension.

Overview

Compressing high-capability Large Language Models (LLMs) has emerged as a favored strategy for resource-efficient inferences. While state-of-the-art (SoTA) compression methods boast impressive advancements in preserving benign task performance, the potential risks of compression in terms of safety and trustworthiness have been largely neglected. This study conducts the first, thorough evaluation of three (3) leading LLMs using five (5) SoTA compression techniques across eight (8) trustworthiness dimensions, encompassing a range of compression rates.

Our experiments highlight the intricate interplay between compression and trustworthiness, revealing some interesting patterns.

  • We find that quantization is a more effective approach than pruning in achieving efficiency and trustworthiness simultaneously. For instance, a 4-bit quantized model retains the trustworthiness of its original counterpart, but model pruning significantly degrades trustworthiness, even at 50% sparsity.
  • Moreover, employing quantization within a moderate bit range could unexpectedly improve certain trustworthiness dimensions such as ethics and fairness.
  • Conversely, extreme quantization to very low bit levels (3 bits) tends to significantly reduce trustworthiness. This increased risk cannot be uncovered by looking at benign performance alone, in turn, mandating comprehensive trustworthiness evaluation in practice.
  • These findings culminate in practical recommendations for simultaneously achieving high utility, efficiency, and trustworthiness in LLMs.

Benchmark

Understanding the trustworthiness of compressed models requires a comprehensive evaluation to gain insights. In this paper, we are interested in three questions: (1) What is the recommended compression method in the joint view of multi-dimensional trustworthiness and standard performance? (2) What is the optimal compression rate for trading off trustworthiness and efficiency? (3) In extreme compression rates (3-bit quantization), how will the compressed models perform according to our metrics?

Type Method Compression Rate Weight Update Calibration
Pruning Magnitude 50% (2:4) βœ— weight
Pruning SparseGPT 50% (2:4) βœ“ weight w/ 128 samples
Pruning Wanda 50% (2:4) βœ— weight & act. w/ 128 samples
Quant. GPTQ 3,4,8-bit βœ“ weight w/ 128 samples
Quant. AWQ 3,4,8-bit βœ“ act. w/ 128 samples

To answer the above questions, we curate a set of state-of-the-art training-free pruning and quantization methods that can lead to true acceleration on hardware. We compress popular LLMs (LLAMA2 Chat, LLAMA2, and Vicuna) with 13 billion parameters (13b) with published model checkpoints at huggingface. Codes for preparing the models are available at the decoding-comp-trust/comp-trust repo. Our assessment is based on two benchmarks: MMLU (as the benign performance) and DecodingTrust including 8 trustworthy dimensions. The code checkpoint of DecodingTrust with some customization is available at here.

Leaderboard

A leaderboard consisting of compressed and dense models are as follows. We include 8 trust dimensions: Stereotype, Privacy, Toxicity, Fairness, Adversarial Robustness (AdvGLUE++), Out-Of-Distribution (OOD) Robustness, Robustness to Adversarial Demonstrations (AdvDemo), and Ethics. A full comparision with other dense models, please refer to the DecodingTrust Leaderboard. Note that the results in standard DecodingTrust Leaderboard will be different from results here referring to our code checkpoint.

Revisiting Paths to 7B-sized LLMs: Training Smaller, or Compressing Larger?

Scaling up the parameters of an LLM is believed to be a general strategy for enhancing various generation abilities, including reasoning, math, language understanding, etc. Existing supportive findings encourage people to train larger and larger models (Kaplan, et al., 2020). But serving models on consumer-grade GPUs contrarily demands more efficient and often smaller models. As a popular choice for deployment, 7b LLMs are suitably tailored to be accommodated by numerous consumer-grade GPUs.

7b results

The above figure present the relative score difference w.r.t. 13b models. Every model is compressed at a 50% rate that leads to a similar model size as the 7b model. Darker blue/red colors indicate more improvement/drops w.r.t. to the 13b dense models. Gray dots/lines per cell indicate significantly lower/higher refusal rates (over 10%) which cast biases in the actual opinion/knowledge of a model. Quantization appears to be the most effective solution with minimal loss both on trustworthiness and on benign performance.

The main takeaways are:

  • 7b models outperform their 13b counterparts in 3-4 trust dimensions by over 5 points, among which Fairness is consistently favored for all models.
  • Quantizing 13b models into 8-bit precision (7b-sized) incurs negligible (smaller than 3-point) drops across all metrics.
  • Pruning suffers from serious loss on at least three dimensions by over 5 points. Except for MMLU and OOD, results in most dimensions are different across models.

From Moderate to High Compression Rates: The (Unexpected) Gains and Losses

As a byproduct of our method, we can also solve the matting problem by ignoring samples that fall outside of a bounding box during rendering.

barplot_quant_LLAMA2_13b_Chat

The above figure shows the effect of compressing LLAMA2 13b Chat to the low-bit region (lower than 8 as represented in the x-axis) will be less consistent with the dense model but the effect may be positive in some perspectives. Black/red lines indicate the performance of 13b and 7b dense models, respectively. Standard deviations are reported with fewer bits. Grey areas indicate drops over 5 points. Dash lines represent the +/- 5 points w.r.t. the scores of the 13b model. The main takeaways are:

  • The optimal compression rate for trustworthiness is 4-bit for LLAMA2 Chat 13b with less than 5 points loss on all dimensions.
  • 4-bit quantization brings joint enhancement of efficiency and trustworthiness (fairness and ethics) for LLAMA2 Chat.
  • At 3-bit precision, although AWQ shows a good benign performance (MMLU), both AWQ and GPTQ significantly increase trustworthiness risks across multiple dimensions, with GPTQ degrading over 50 points in the worst case.

Bag of Tricks for Trustworthy Compression

We summarize the guidance for compressing a trustworthy LLM as follows.

  1. In terms of efficiency, both quantization and pruning can work, but quantization is more reliable for obtaining LLMs with similar trustworthiness as the source model at the same compression rate.
  2. Choose a trustworthy dense model to start with, since quantized models will achieve similar trustworthiness as the dense source model.
  3. If the model weights (or pruning choices) are calibrated with a random set of data, the heavily compressed model should be fully evaluated to avoid potential risks before deployment.

Conclusion

In conclusion, this study offers novel insights into the trustworthiness of compressed Large Language Models (LLMs), highlighting the complex interplay between model efficiency and various dimensions of trustworthiness. Our comprehensive evaluation of state-of-the-art compression techniques unveils the unique impact of model compression on trustworthiness facets, emphasizing the potential of quantization in enhancing specific dimensions at a minimal cost. These findings provide a nuanced understanding of the trade-offs between the efficiency and trustworthiness involved in LLM compression. We envision our findings will pave the way for the development of efficient yet trustworthy AI language models.

BibTeX

@article{hong2024comptrust,
    title={Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression},
    author={Hong, Junyuan and Duan, Jinhao and Zhang, Chenhui and Li, Zhangheng 
      and Xie, Chulin and Lieberman, Kelsey and Diffenderfer, James 
      and Bartoldson, Brian and Jaiswal, Ajay and Xu, Kaidi and Kailkhura, Bhavya 
      and Hendrycks, Dan and Song, Dawn and Wang, Zhangyang and Bo Li},
    journal={arXiv:2403.15447},
    year={2024}
}