For each of the individual frameworks, there are common ways of evaluating the quality of parser outputs in terms of graph similarity to gold-standard target representations (Dridan & Oepen, 2011; Cai & Knight, 2013; Oepen et al., 2014; Hershcovich et al., 2017).  There is broad similarity between the framework-specific evaluation metrics used to date, although there are some subtle differences too.  In a nutshell, meaning representation parsing is commonly evaluated in terms of a graph similarity F1 score at the level of individual node–edge–node triples, i.e. ‘atomic’ dependencies.  Variations in extant metrics relate to among others, how node correspondences across two graphs are established, whether edge labels can optionally be ignored in triple comparison, and how top nodes (and maybe additional node properties) are scored; see the formal background.

Evaluation Metrics

For the shared task, we will implement a (straightforward) generalization of existing, framework-specific metrics that is (a) applicable across different flavors of semantic graphs, (b) provides a labeled and unlabeled variant, (c) does not require matching node anchoring, but (d) takes advantage of node ordering when available.  Labeled per-dependency scores will be the official metric for the task, but we will also provide additional cross-framework evaluation perspectives (e.g. considering larger sub-graphs in the spirit of the complete predications metric of Oepen et al., 2015).  Finally, we will also score parser outputs in the ‘classic’ framework-specific metrics, for direct comparison to published prior results.

Software Support

Baseline Results

XHTML 1.0 | Last updated: 2019-03-06 (22:03)