Keyphrase Extraction: The Open Problem of Algorithm Comparability and a Path toward its Resolution
Abstract
Keyphrases provide a compact representation of a document‘s content. Being more descriptive than single keywords, keyphrases help to perform efficient text mining. Keyphrase extraction remains a difficult problem due to the limited performance of existing algorithms. Developing new approaches requires a robust framework for evaluating proposed methods. % and a means to compare different methods. However, almost no attention is given to the validity of comparisons in the domain. There is a lack of studies that comprehensively summarize the evaluation issues to be considered when comparing keyphrase extraction algorithms. There is a lack of studies that comprehensively summarize the evaluation issues in the keyphrase extraction area. This research aims to bridge this gap and systematize the existing differences in F1-score calculation, which is the most standard evaluation score in the domain. We demonstrate the extent to which the macro-average F1-score can vary when calculated for the same datasets and algorithms but in different manners. We propose the 'Q-10' evaluation pipeline comprising ten questions regarding the evaluation process that should be considered for the correct comparison of results across different algorithms. Additionally, we collected quality evaluation scores for unsupervised methods using various calculations of the F1-score. We compared the results of unsupervised approaches with those based on RNNs and transformers for knowledge extraction. The study involved 10 unsupervised algorithms and 6 well-known datasets.
Keywords
Keyphrase extraction, evaluation problem, F1-score, keyphrase extraction algorithms comparison