Skip to content

SigmaWe/EvalBase

Repository files navigation

An evaluation framework for text-to-text generation tasks

EvalBase allow easy testing of new text-to-text generation metrics. A computer program or a machine learning model that can generate text from text is called a system. Such text-to-text generation tasks can be summarization, translation, or even question answering. Correspondingly, the systems for such tasks are called summarizers, translators, and answer.

There are usually two approaches to text-to-text generation: reference-based and reference-free. For each sample in a test set, a reference-based approach first let a human do the same task to produce a reference and then compares a machine/system-generated output against the reference, while reference-free approaches directly scores the generated text with respect to the input text.

Usage

For each dataset that EvalBase supports, a main() function is provided. So to use EvalBase, simply add EvalBase into your $PYTHONPATH environment variable, and then call the main() function with the proper argument which is a dictionary of configurations. See run_exp.py to see the configurations and how to call the main() function. For how to add new metrics, see this section.

Getting and preparing dataset files

All dataset files and their processing scripts must be under dataloader folder.

  • SummEval: Run summeval_build.sh
  • Realsumm: Run realsumm_build.sh
  • Newsroom:
    • Download newsroom-human-eval.csv, the human evaluation result, including documents and system summaries but no reference summaries:
      wget https://github.com/lil-lab/newsroom/raw/master/humaneval/newsroom-human-eval.csv
    • Get test.jsonl, the test split of Newsroom, containing reference summaries. No automatic script. You will have to fill out a web form here and then follow the link in your email to download. test.jsonl is in the downloaded tar ball.
  • TAC: We assume that you have fully recursively extracted the two files.
    • GuidedSumm2010_eval.tgz Downloadable from web, containing human evaluation results and system summaries.
    • TAC2010_Summarization_Documents.tgz Emailed by NIST, containing the documents for which summaries are generated and rated. Both files require you to apply to NIST for access.

Running the evaluations

Once you have the files above, you can run the evaluations:

python3 run_exp.py

Where are the results?

By default, all under results folder. Or you can modify the result_path_root in the configuration dictionary passes to the main() function for a dataset.

Adding your new metrics into the test

Just add your metric as an entry into the NLG_metrics dictionary (keys are metric names as strings while values are functions -- see below) in the configuration dictionary.

Each metric function must follow the I/O of HuggingFace's the BERTScore function in the evaluate library:

  • Two must-have positional arguments:
    • the output text by a text-to-text generation system, as a List[str]
    • the input text (in ref-free mode) or the reference (in ref-based mode), as a List[str].
  • Return must be a dictionary of type dict[str, List[float]], e.g., {'precision': [0.5, 0.6], 'recall': [0.7, 0.8], 'f1': [0.9, 0.75]}.

File structures/functions

  • eval_utils.py: scoring summaries using automated metrics and computing their correlations with human scores
    • eval_summary_level: the main function that does summary-level evaluation. It loads a dataset_df (see specifications below).
    • eval_system_level: To be finished by those taking CS 579X
  • Dataloader and evaluation scripts for each dataset:
    • newsroom.py: Run experiments on Newsroom dataset
    • realsumm.py: Run experiments on realsumm dataset
    • summeval.py: Run experiments on SummEval dataset
    • tac2010.py: Run experiments on tac2010 dataset

Key Pandas.DataFrames in eval.py

We use the same local variable names for key DataFrames across functions in eval.py to be consistent. Please read more details of such variables in the docstrings/comments inside eval.py

  1. dataset_df: A summary human evaluation dataset represented as a pandas.DataFrame. The following columns must be there:
  • ArticleText -- the source text
  • System -- the name of the system/summarizer
  • SystemSummary -- a system summary generated by the System at the same row
  • ReferenceSummary -- corresponding reference summary provided by the dataset
  • Various columns of human evaluation aspects, as defined in env.human_metrics
  1. batch_result_df: Each row corresponds to a sample and each column corresponds to a MultiIndex (approach, model, score_name). score_name is a variant of a model, for example, ROUGE-1 is a score_name for ROUGE which is a model.
  2. corr_df: The correlation coefficient dataframe, the last column which is like

About

base code for summary evaluation experiments

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published