Skip to content

[NLPCC 2024] Shared Task 10: Regulating Large Language Models

License

Notifications You must be signed in to change notification settings

zjunlp/NLPCC2024_RegulatingLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLPCC2024_RegulatingLLM

More information will be available shortly.

Background

The rise of large language models has brought about significant advancements in the field of natural language processing. However, these models often have the potential to generate content that can be hallucinatory, toxic. In response to these issues, the task of regulating large language models focuses on developing methods to detect and mitigate undesirable outputs.

Task Overview

This shared task includes two tracks:

Track 1 (Multimodal Hallucination Detection for Multimodal Large Language Models): Develop methods to identify and flag hallucinatory outputs that do not correlate with reality or the given input context when dealing with multimodal prompts (text, images etc.). This track would involve creating detection algorithms that can discern between accurate and hallucinated responses across different modalities, thereby ensuring the reliability of the model's outputs.

Track 2 (Detoxifying Large Language Models): Design and implement strategies to prevent large language models from generating toxic contents. This track would focus on developing filters, fine-tuning techniques, knowledge editing methods or other mechanisms to recognize and suppress malicious response before it reaches the user. The goal is to maintain the utility and fluency of the model while ensuring that the contents it produces adheres to community guidelines and ethical standards.

Dataset and Rules

Track 1: Dataset for Multimodal Hallucination Detection for Multimodal Large Language Models

You can download the datasets via this link.

The expected structure of files is:

data
├── train.json                     # training dataset
├── val.json                       # validation dataset
└── test.json                      # test dataset which we will release in the future

❗️❗️Data Utility Rules: Due to the use of open source data, we do not provide image data. You need to download MSCOCO-train2014, MSCOCO-val2014, TextVQA-train, and TextVQA-test by yourself. For model training, only the data provided by this link is allowed to be used as supervised data, which includes train.json, val.json. test.json will be used to evaluate the hallucination detected model or pipeline.

For more information related to this dataset, please refer to our paper: Unified Hallucination Detection for Multimodal Large Language Models.

Track 2: Dataset for Detoxifying Large Language Models

You can download the datasets via this link.

The expected structure of files is:

data
├── SafeEdit_train                     # training dataset
├── SafeEdit_val                       # validation dataset
├── SafeEdit_test_ALL                  # test dataset for Task 10 of NLPCC2024, which can be used to evaluate knowledge editing and traditional detoxification methods
└── data_used_for_analysis
    └── three_instances_for_editing    # three instances for editing vanilla LLM via knowledge editing method

❗️❗️Data Utility Rules: For model training, only the data provided by this link is allowed to be used as supervised data, which includes SafeEdit_train, SafeEdit_val, three_instances_for_editing. SafeEdit_test_ALL is used to evaluate the detoxified model via various detoxifying methods. SafeEdit_test_ALL and any variations of it cannot be used during the training phase. Note that SafeEdit_test in this link should not be used at any stage of the Task 10 of NLPCC 2024.

For more information related to this dataset, please refer to our paper: Detoxifying Large Language Models via Knowledge Editing. If there are any differences between the paper and this page, the content of this page should prevail.

Evaluation

Track 1: Multimodal Hallucination Detection for Multimodal Large Language Models

We recommend using models with fewer hallucinations and better performance, such as LLaVA, DeepSeek-VL, Qwen-VL, etc. The evaluation metrics include two main categories: Rule-based metric and Rationality-based metric.

  • Rule-based metric: Use macro-f1 to roughly evaluate the effect of hallucination detection

  • Rationality-based metric: When the average values of multiple macros are similar, we use manual evaluation or evaluate the reasonability of the generated reason based on GPT.

Track 2: Detoxifying Large Language Models

Please select LLaMA2-7B-Chat as the vanilla Large Language Model. Track 2 aims to enhance its security defense against malicious inputs. The evaluation metrics include two main categories: detoxification performance and side effects.

  • Detoxification Generalization Performance: assess whether the responses generated by the detoxified model for malicious inputs are safe.
    • DGonlyQ: the detoxification success rate for unseen harmful question.
    • DGotherAQ: the detoxification success rate for unseen attack prompts and harmful questions.

❗️❗️ Please set max_new_tokens=600 for the responses generated by the detoxified model for malicious inputs in the test dataset.

  • Side Effects: evaluate of the fluency of responses generated by the detoxified model for malicious inputs as well as the capability of the detoxified model on some general tasks (harmfulless user query).
    • Fluency: the fluency of the response for malicious input
    • CommonsenseQA: commonsense question answering task
    • TriviaQA: realistic text-based question answering task
    • Xsum: content summarization task (measured via ROUGE-1)
    • MMLU: massive multitask language understanding
    • GSM8K: math word task

❗️❗️ For the evaluation of metrics DGonlyQ, DGotherAQ, and Fluency, you only need to submit the responses generated by the detoxified model for malicious inputs from SafeEdit_test_ALL. For the other metrics, please use OpenCompass tool to assess the detoxified model and obtain the corresponding results.

❗️❗️In terms of usage, LLaMA2-7B-Chat uses gen evaluation for CommonsenseQA, TriviaQA, Xsum, MMLU, GSM8K (refer this link). As for the number of shots in few-shot evaluation (refer this link), CommonsenseQA uses 4-shot, and GSM8K uses 2-shot due to the max input length of LLaMA2-7B-Chat. Other settings use the default settings of OpenCompass. We will also soon write a tutorial on how to evaluate the above tasks using OpenCompass.

Baseline Results

Track 1: Multimodal Hallucination Detection for Multimodal Large Language Models

Note: The code and details for UniHD and HalDet-LLaVA can refer to EasyDetect. If you want to finetune the model, the minimum GPU memory you need is single card 20G refer to LLaVA-Llama-3-8B (Youth Edition) and the reasonable GPU memory you need is single card 80G refer to LLaVA-v1.5

The claim level results on validation dataset

  • Self-Check(GPT-4V) means use GPT-4V with 0 or 2 cases
  • UniHD(GPT-4V/GPT-4o) means use GPT-4V/GPT-4o with 2-shot and tool information
  • HalDet (LLAVA) means use LLAVA-v1.5 trained on our train datasets
task type model Acc Prec avg Recall avg Mac.F1
image-to-text Self-Check 0shot (GPV-4V) 75.09 74.94 75.19 74.97
Self-Check 2shot (GPV-4V) 79.25 79.02 79.16 79.08
HalDet (LLAVA-7b) 75.02 75.05 74.18 74.38
HalDet (LLAVA-13b) 78.16 78.18 77.48 77.69
UniHD(GPT-4V) 81.91 81.81 81.52 81.63
UniHD(GPT-4o) 86.08 85.89 86.07 85.96
text-to-image Self-Check 0shot (GPV-4V) 76.20 79.31 75.99 75.45
Self-Check 2shot (GPV-4V) 80.76 81.16 80.69 80.67
HalDet (LLAVA-7b) 67.35 69.31 67.50 66.62
HalDet (LLAVA-13b) 74.74 76.68 74.88 74.34
UniHD(GPT-4V) 85.82 85.83 85.83 85.82
UniHD(GPT-4o) 89.29 89.28 89.28 89.28

Track 2: Detoxifying Large Language Models

The detoxification performance on SafeEdit_test_ALL and basic ability on some general tasks.

  • SFT: fine-tune the entire model
  • DPO: adopt direct preference optimization
  • DINM: detoxify via model editing using only one instance

We will soon release codes for the above methods and offer some promising strategies and suggestions for this track. If necessary, You can access these resources from this link.

Method Avg DGonlyQ DGotherAQ Fluency CommonsenseQA TriviaQA Xsum MMLU GSM8K
Vanilla 40.98 84.44 47.41 6.16 46.93 55.15 22.29 38.23 27.22
SFT 45.96 91.85 70.74 3.27 54.63 54.63 24.05 41.78 26.69
DPO 46.31 91.11 77.28 3.59 54.05 50.14 24.09 42.35 27.90
DINM 47.23 93.33 86.05 5.87 48.89 53.37 20.22 43.58 26.54

❗️❗️If conducting experiments using an A800 GPU, calculating the MMLU metric takes around 12 hours, while each of the other metrics only takes about 4 hours.

Guidelines for Participants

Track 2: Detoxifying Large Language Models

The optimization strategy for Track 2 can include the following approaches:

  • Self-improvement: aim to modify the parameters of vanilla LLaMA2-7B-Chat to enhance their security, e.g., SFT, DPO, RLHF, knowledge editing, SimPO.
  • Input toxicity detection: filter out malicious attacks from users at the input stage. For example, using toxicity classifiers to detect whether a user's input is toxic. If it is deemed toxic, the response is rejected.
  • Prompt: leverage prompts (including RAG) to enhance the toxicity defense capability of vanilla LLaMA2-7B-Chat.

❗️❗️Toxicity detection at the output stage is not allowed in the competition (for example, using toxicity classifiers to detect whether the output is toxic and rewriting the original response if it is toxic). ❗️❗️In the competition, the use of other open-source models for input filtering and detection is permitted; however, the use of closed-source models and additional data is strictly prohibited. ❗️❗️For model training, only the data provided by this link is allowed to be used as supervised data, which includes SafeEdit_train, SafeEdit_val, three_instances_for_editing. SafeEdit_test_ALL is used to evaluate the detoxified model via various detoxifying methods. SafeEdit_test_ALL and any variations of it cannot be used during the training phase. Note that SafeEdit_test in this link should not be used at any stage of the Task 10 of NLPCC 2024.

We provide baseline code for track 2, you can achieve it in NLPCC Section by this link.

Submission

Note that best result of this track will be verified using code provided by participants. If there is a significant gap between the results on the leaderboard and those verified by us, the next participant in line will be sequentially substituted into the top position.

Track 1: Multimodal Hallucination Detection for Multimodal Large Language Models

We submit using CodaBench.

The submission steps are as follows:

  • Registering a CodaBench account
  • search competition: NLPCC2024 TASK 10 - TRACK 1
  • upload your submission. Only upload the zipped model results file, the specific format can refer to res.zip

Note:At present, only submit the results of the validation set for the testing phase, with 100 submission opportunities per person. Formal submissions will begin once the testing set is released

Track 2: Detoxifying Large Language Models

We submit using CodaBench.

The submission steps are as follows:

  • Registering a CodaBench account
  • Search competition: NLPCC2024 TASK 10 - TRACK 2
  • Upload your submission. Only upload the zipped model results file, the specific format can refer to res.zip
  • Details can be found in README

Participation

If you're intrigued by our challenge, please fill out the Registration Form (Word File) and send it to the following registration email.

Registration Email: mengruwg@zju.edu.cn

we also create a discussion group for this task. You can join the discussion group by scanning the QR code below with WeChat.

Important Dates

  • 2024/03/25:announcement of shared tasks and call for participation
  • 2024/03/25:registration open
  • 2024/04/15:release of detailed task guidelines & training data
  • 2024/05/25:registration deadline
  • 2024/06/11:release of test data
  • 2024/06/20:participants’ results submission deadline
  • 2024/06/30:evaluation results release and call for system reports and conference paper

Leaderboard

Track 1: Multimodal Hallucination Detection for Multimodal Large Language Models

More information will be available shortly.

Track 2: Detoxifying Large Language Models

More information will be available shortly.

📖 Citation

Please cite our paper if you use our dataset.

@article{wang2024SafeEdit,
  author       = {Mengru Wang and
                  Ningyu Zhang and
                  Ziwen Xu and
                  Zekun Xi and
                  Shumin Deng and
                  Yunzhi Yao and
                  Qishen Zhang and
                  Linyi Yang and
                  Jindong Wang and
                  Huajun Chen},
  title        = {Detoxifying Large Language Models via Knowledge Editing},
  journal      = {CoRR},
  volume       = {abs/2403.14472},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2403.14472},
  doi          = {10.48550/ARXIV.2403.14472}
}

@article{chen24unihd,
  author       = {Xiang Chen and
                  Chenxi Wang and
                  Yida Xue and
                  Ningyu Zhang and
                  Xiaoyan Yang and 
                  Qiang Li and
                  Yue Shen and
                  Lei Liang and
                  Jinjie Gu and
                  Huajun Chen},
  title        = {Unified Hallucination Detection for Multimodal Large Language Models},
  journal      = {CoRR},
  volume       = {abs/2402.03190},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2402.03190},
  doi          = {10.48550/ARXIV.2402.03190}
}

Supporting Organization

OpenKG

Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph

If you have any questions about this task, please email to mengruwg@zju.edu.cn or xiang_chen@zju.edu.cn