Skip to content

Trustworthiness in LLMs

Overview

Trustworthy LLMs are important to build applications in high-stake domains like health and finance. While LLMs like ChatGPT are very capable of producing human readable responses they don't guarantee trustworthy responses across dimensions like truthfulness, safety, and privacy, among others.

Research Study

Sun et al. (2024) recently proposed a comprehensive study of trustworthiness in LLMs, discussing challenges, benchmarks, evaluation, analysis of approaches, and future directions.

Key Challenge

One of the greater challenges of taking current LLMs into production is trustworthiness. Their survey proposes a set of principles for trustworthy LLMs that span 8 dimensions, including a benchmark across 6 dimensions (truthfulness, safety, fairness, robustness, privacy, and machine ethics).

Benchmark Framework

The authors proposed the following benchmark to evaluate the trustworthiness of LLMs on six aspects:

A Benchmark of Trustworthy Large Language Models

Below are the definitions of the eight identified dimensions of trustworthy LLMs.

Dimensions of Trustworthy LLMs

The framework evaluates LLMs across multiple trustworthiness dimensions to ensure comprehensive assessment.

Research Findings

This work also presents a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Below are the main findings from the evaluation:

Model Performance Comparison

  • Proprietary vs Open-Source: While proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, there are a few open-source models that are closing the gap.

  • Advanced Capabilities: Models like GPT-4 and Llama 2 can reliably reject stereotypical statements and show enhanced resilience to adversarial attacks.

  • Open-Source Performance: Open-source models like Llama 2 perform closely to proprietary ones on trustworthiness without using any type of special moderation tool. It's also stated in the paper that some models, such as Llama 2, are overly calibrated towards trustworthiness which at times compromises their utility on several tasks and mistakenly treats benign prompts as harmful inputs to the model.

Key Insights by Dimension

Over the different trustworthiness dimensions investigated in the paper, here are the reported key insights:

1. Truthfulness

  • LLMs often struggle with truthfulness due to training data noise, misinformation, or outdated information
  • LLMs with access to external knowledge sources show improved performance in truthfulness

2. Safety

  • Open-source LLMs generally lag behind proprietary models in safety aspects like jailbreak, toxicity, and misuse
  • There is a challenge in balancing safety measures without being overly cautious

3. Fairness

  • Most LLMs perform unsatisfactorily in recognizing stereotypes
  • Even advanced models like GPT-4 have only about 65% accuracy in this area

4. Robustness

  • There is significant variability in the robustness of LLMs
  • Performance varies especially in open-ended and out-of-distribution tasks

5. Privacy

  • LLMs are aware of privacy norms, but their understanding and handling of private information vary widely
  • As an example, some models have shown information leakage when tested on the Enron Email Dataset

6. Machine Ethics

  • LLMs demonstrate a basic understanding of moral principles
  • However, they fall short in complex ethical scenarios

Trustworthiness Leaderboard

The authors have also published a leaderboard for comparing LLM trustworthiness. For example, the table below shows how the different models measure on the truthfulness dimension. As mentioned on their website, "More trustworthy LLMs are expected to have a higher value of the metrics with ↑ and a lower value with ↓".

Trustworthiness Leaderboard for LLMs

The leaderboard provides comparative metrics across different trustworthiness dimensions for easy model evaluation.

Implementation

Code Repository

You can also find a GitHub repository with a complete evaluation kit for testing the trustworthiness of LLMs across the different dimensions.

Code: https://github.com/HowieHwong/TrustLLM

References

Image Source / Paper: TrustLLM: Trustworthiness in Large Language Models (10 Jan 2024)

Key Takeaways

  1. Trustworthiness is multi-dimensional and requires comprehensive evaluation
  2. Proprietary models generally outperform open-source alternatives in trustworthiness
  3. Safety and fairness remain challenging areas for most LLMs
  4. Balancing trustworthiness and utility is crucial for practical applications
  5. Open-source models are improving and closing the gap with proprietary solutions