Skip to content

ashleyzhou972/CAFA_assessment_tool

Repository files navigation

Precision-Recall Assessment of Protein Function Prediction

DOI

Introduction

Critical Assessment of Function Annotation (CAFA), is a community-wide challenge designed to provide a large-scale assessment of computational methods dedicated to predicting protein function.

More information can be found at http://biofunctionprediction.org/cafa/ as well as the CAFA2 paper (Jiang et al, 2016)

This toolset provides an assessment for CAFA submissions based on precision and recall.

For bug reports, comments or questions, please email nzhou[AT]iastate.edu.

Dependencies

$ sudo apt install python-biopython python-yaml python-matplotlib python-seaborn

Main Functions

We provide two main functions to assist in the evaluation of GO-term prediction within the scope of CAFA, the main assessment function and the plot function.

  • assess_main.py
    • Only input needed is the configuration file config.yaml, where the following four parameters are specified in the first section assess.
    • First parameter file: prediction file formatted according to CAFA3 formats
    • Second parameter obo: path of the gene ontology obo file. The latest version can be downloaded here. Note that the obo file used here should not be older than the one used in the prediction. The obo files used in both CAFA2 and CAFA3 are provided in the ./precrec/ folder.
    • Third parameter benchmark: directory of the benchmark folder. Specific formats are required for the benchmark folder, including two sub-directories: groundtruth and lists. Please refer to auxiliary function benchmark_folder.py for the creation of this folder, as well as the genral creation of benchmarks. Benchmarks from CAFA2 and CAFA3 are given in this repository ./precrec/benchmark
    • Fourth parameter results: Folder where results are saved. A pr_rc folder will be created within the results folder.
    • Note that only the first section assess of the configuration file is used here, the rest of the configuration file can be ignored for this function
  • plot.py
    • Only input needed is the configuration file config.yaml, where the following parameters are specified in the second section plot.
    • First parameter results: the results from the assess_main.py function.
    • Second parameter title: title of the plot. Optional.
    • Third parameter smooth: whether the precision-recall curves should be smoothed. Input 'Y' or 'N'.
    • Fourth parameter(s) fileN: name of the result file to be plotted. Can add up to 12 files. These results will be drawn on the same plot.
    • Example: if the prediction file is ZZZ_1_9606.txt, the result file in the results folder will be ZZZ_1_9606_results.txt. Only input ZZZ_1_9606 in the above parameter for plotting.

Auxiliary Functions

CAFA3 released its protein targets in September 2016. Each protein target has a unique CAFA3 ID. To run the above assessment function, each protein should be represented by its CAFA3 ID. However, the benchmark proteins generated by the benchmark creation tool are identified by UniProt Accession IDs. Therefore, we here provide functions to convert between UniProt IDs and CAFA3 IDs. We also provide a function that converts benchmark files generated by the benchmark creation tool to a benchmark folder that can feed into this program.

  • benchmark_folder.py

    • Refer to python benchmark_folder.py -h for syntax of using this function by itself.
    • If using our benchmark creation tool, then the benchmark_pipeline.sh file is a good example of how to generate a benchmark folder for assess_main.py from the raw benchmarks.
    • Input your own folder names and different gaf file names in the blanks left in benchmark_pipeline.sh.
  • ./ID_conversion/ID_conversion.py

    • Two functions are written in this python script, one converts UniProt Accessions to CAFA3 IDs, the other function converts the other way around.
    • First function uniprotac_to_cafaid(taxon, uniprotacs).
    • Second function cafaid_to_uniprot(taxon, cafaids).
    • Refer to comments in the script ./ID_conversion/ID_conversion.py and third example below for usage.

Examples

  • ./assess_main.py config.yaml
  • ./plot.py config.yaml
  • ./ID_conversion/ID_conversion.py ./ID_conversion/example_uniprot_accession_8355.txt 8355 ./ID_conversion/example_output.txt
  • ./benchmark_pipeline.sh

References

Zhou, N., Jiang, Y., Bergquist, T.R. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol 20, 244 (2019) doi:10.1186/s13059-019-1835-8

Jiang, Y., Oron, T., Clark, W. et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol 17, 184 (2016) doi:10.1186/s13059-016-1037-6

http://biofunctionprediction.org/cafa/