Background

For each of the individual frameworks, there are established ways of evaluating the quality of parser outputs in terms of graph similarity to gold-standard target representations (Dridan & Oepen, 2011; Cai & Knight, 2013; Oepen et al., 2014; Hershcovich et al., 2017). There is broad similarity between the framework-specific evaluation metrics used to date, although there are some subtle differences too. Meaning representation parsing is commonly evaluated in terms of a graph similarity F₁ score at the level of individual node–edge–node and node–property–value triples. Variations in extant metrics relate to among others, how node correspondences across two graphs are established, whether edge labels can optionally be ignored in triple comparison, and how top nodes (and other node properties, including anchoring) are evaluated; see the formal background page.

In a nutshell, semantic graphs in all frameworks can be broken down into ‘atomic’ component pieces, i.e. tuples capturing (a) top nodes, (b) node labels, (c) node properties, (d) node anchoring, (e) unlabeled edges, (f) edge labels, and (g) edge attributes. Not all tuple types apply to all frameworks, however.

		EDS	PTG	UCCA	AMR	DRG
A	Top Nodes	✓	✓	✓	✓	✓
B	Node Labels	✓	(✓)	✗	✓	(✓)
C	Node Properties	✓	✓	✗	✓	✗
D	Node Anchoring	✓	(✓)	(✓)	✗	✗
E	Directed Edges	✓	✓	✓	✓	✓
F	Edge Labels	✓	✓	✓	✓	(✓)
G	Edge Attributes	✗	(✓)	(✓)	✗	✗

To evaluate any of these tuple types, a correspondence relation must be established between nodes (and edges) from the gold-standard vs. the system graphs. This relation presuposes a notion of node (and edge) identities, which is where the various flavours and frameworks differ. In ‘classic’ bi-lexical (semantic) dependencies—e.g. the DM and PSD graphs from earlier shared tasks; our Flavor (0)—the nodes are surface lexical units (‘tokens’); their identities are uniquely determined as the character range of the corresponding sub-strings (rather than by token indices, which would not be robust to tokenization mis-matches)

In the Flavor (1) graphs (EDS, PTG, and UCCA), multiple distinct nodes can have overlapping or even identical anchors; in EDS, for example, the semantics of an adverb like today is decomposed into four nodes, all anchored to the same substring: implicit_q x: time_n(x) ∧ _today_a_1(x) ∧ temp_loc(e, x). The standard EDS and UCCA evaluation metrics determine node identities through anchors (and transitively the union of child anchors, in the case of UCCA) and allow many-to-many correspondences across the gold-standard and system graphs (Oepen et al., 2014; Hershcovich et al., 2017).

Finally, as Flavor (2) frameworks, nodes in AMR graphs and DRG are unanchored. Thus, node-to-node correspondences need to be established (as one-to-one equivalence classes of node identifiers), to maximize the set of shared tuples between each pair of graphs. Abstractly, this is an instance of the NP-hard maximum common edge subgraph isomorphism problem (where node-local tuples can be modeled as ‘pseudo-edges’ with globally unique target nodes). The SMATCH scorer for AMR approximates a solution through a hill-climbing search for high-scoring correspondences, with a fixed number of random restarts (Cai & Knight, 2013).

Evaluation Metrics

For the shared task, we have implemented a generalization of existing, framework-specific metrics, along the lines above. Our goal is for the unified MRP metric to (a) be applicable across different flavors of semantic graphs, (b) enable labeled and unlabeled variants, as much as possible, (c) not require corresponding node anchoring, but (d) minimize concerns about non-deterministic approximations, and (e) take advantage of anchoring information when available. Labeled per-node and per-edge F₁ score averages will be the official metric for the task.

The basic principle is that all information presented in the MRP graph representations is scored with equal weight, i.e. all applicable tuple types for each framework. However, types (E) and (F) are not considered separately (i.e. edges and their labels), in part because it does not appear desirable to try and give credit for edges with incompatible labels (e.g. an ARG1 with an ARG3), in part because it makes the search for node-to-node correspondences somewhat more tractable. There is no special status (or ‘primacy’) to anchoring in this scheme: unlike the original SDP, EDM, and UCCA metrics, the MRP scorer searches for a correspondence relation between the gold-standard and system graphs that maximizes tuple overlap. Thus, the MRP approach is abstractly similar to SMATCH, but using a search algorithm that considers the full range of different tuple types and finds an exact solution in the majority of cases (while avoiding a few known issues related to over-counting, normalization, and top nodes).

Anchoring (for all frameworks but AMR) in this scheme is treated on a par with node labels and properties, labeled edges, and edge attributes. Likewise, node properties like tense, mood etc. (in EDS) or frame and sempos (in PTG) are scored with equal weight as the node labels, as such properties further specify the semantic predication or disambiguate its predicate sense. For AMR evaluation, there is an exception to the above principle that all information in MRP graphs be scored equally: The MRP representations of AMR graphs preserve the tree-like topology used in AMR annotations, using ‘inverted’ edges with labels like ARG0-of. To make explicit which edges actually are inverted, the MRP graphs provide an additional normal property, which is present only an inverted edges and provides the corresponding ‘base’ label (e.g. ARG0). AMR graphs are standardly evaluated in normalized form, i.e. with inverted edges restored to their ‘base’ directionality and label.

Time-permitting, we hope to also consider additional cross-framework evaluation perspectives (e.g. larger sub-graphs in the spirit of the complete predications metric of Oepen et al., 2015 or ‘semantic n-grams’ in the SemBleu proposal of Song & Gildea, 2019). Finally, we will also score parser outputs in the ‘classic’ framework-specific metrics, for direct comparison to published prior results.

Software Support

MRP scoring is implemented in the mtool software (the Swiss Army Knife of Meaning Representation), which is hosted in a public Microsoft GitHub repository to stimulate community engagement. mtool implements a refinement of the maximum common edge subgraph (MCES) algorithm by McGregor (1982), scheduling candidate node-to-node correspondences based on pre-computed per-node rewards and upper bounds on adjacent edge correspondences. In addition to the cross-framework MRP metric, the tool also provides reference implementations of the SDP, EDM, SMATCH, and UCCA metrics, in the case of SDP and UCCA generalized to support character-based anchoring (rather than using token indices).

Value comparison in MRP evaluation will be robust to ‘uninteresting’ variation, i.e. different encoding of essentially the same information. Specifically, literal values will always be compared as case-insensitive strings, such that for example 42 (an integer) and "42" (a string) are considered equivalent, as are "Pierre" and "pierre"; this applies to node and edge labels, node properties, and edge attributes. Anchor values will be normalized for comparison into sets of non-whitespace character positions: assuming the underlying input string contains whitespace at character position 6, the following are considered equivalent:

[{"from": 0, "to": 13}]
[{"from": 0, "to": 6}, {"from": 7, "to": 13}]

Furthermore, character positions corresponding to basic punctuation marks in the left or right periphery of a normalized anchor will be discarded for comparison:

. ? ! ; , : “ " ” ‘ ' ’ ( ) [ ] { }