# Background

For each of the individual frameworks, there are established ways of evaluating the
quality of parser outputs in terms of graph similarity to gold-standard target
representations
(Dridan & Oepen, 2011;
Cai & Knight, 2013;
Oepen et al., 2014;
Hershcovich et al., 2017).
There is broad similarity between the framework-specific evaluation
metrics used to date, although there are some subtle differences too.
Meaning representation parsing is commonly evaluated in
terms of a graph similarity F_{1} score at the level of individual
node–edge–node and node–property–value triples.
Variations in extant metrics relate to among others, how node
correspondences across two graphs are established, whether edge labels
can optionally be ignored in triple comparison, and how top nodes
(and other node properties, including anchoring) are evaluated; see the
formal background page.

In a nutshell, semantic graphs in all frameworks can be broken down into ‘atomic’ component pieces, i.e. tuples capturing (a) top nodes, (b) node labels, (c) node properties, (d) node anchoring, (e) unlabeled edges, (f) edge labels, and (g) edge attributes. Not all tuple types apply to all frameworks, however.

EDS | PTG | UCCA | AMR | DRG | ||
---|---|---|---|---|---|---|

A | Top Nodes |
✓ | ✓ | ✓ | ✓ | ✓ |

B | Node Labels |
✓ | (✓) | ✗ | ✓ | (✓) |

C | Node Properties |
✓ | ✓ | ✗ | ✓ | ✗ |

D | Node Anchoring |
✓ | (✓) | (✓) | ✗ | ✗ |

E | Directed Edges |
✓ | ✓ | ✓ | ✓ | ✓ |

F | Edge Labels |
✓ | ✓ | ✓ | ✓ | (✓) |

G | Edge Attributes |
✗ | (✓) | (✓) | ✗ | ✗ |

To evaluate any of these tuple types, a *correspondence* relation must
be established between nodes (and edges) from the gold-standard vs. the system
graphs.
This relation presuposes a notion of node (and edge) *identities*, which
is where the various flavours and frameworks differ.
In ‘classic’ bi-lexical (semantic) dependencies—e.g. the DM and PSD graphs
from
earlier shared tasks; our
Flavor (0)—the nodes are
surface lexical units (‘tokens’); their identities are uniquely determined as
the character range of the corresponding sub-strings (rather than by token
indices, which would not be robust to tokenization mis-matches)

In the Flavor (1) graphs
(EDS, PTG, and UCCA), multiple distinct nodes can have overlapping or even identical
anchors; in EDS, for example, the semantics of an adverb like *today* is
decomposed into four nodes, all anchored to the same substring:
`implicit_q` *x*:
`time_n`(*x*) ∧ `_today_a_1`(*x*)
∧ `temp_loc`(*e*, *x*).
The standard EDS and UCCA evaluation metrics determine node identities through
anchors (and transitively the union of child anchors, in the case of UCCA) and
allow many-to-many correspondences across the gold-standard and system graphs
(Oepen et al., 2014;
Hershcovich et al., 2017).

Finally, as Flavor (2) frameworks, nodes in AMR graphs and DRG are unanchored. Thus, node-to-node correspondences need to be established (as one-to-one equivalence classes of node identifiers), to maximize the set of shared tuples between each pair of graphs. Abstractly, this is an instance of the NP-hard maximum common edge subgraph isomorphism problem (where node-local tuples can be modeled as ‘pseudo-edges’ with globally unique target nodes). The SMATCH scorer for AMR approximates a solution through a hill-climbing search for high-scoring correspondences, with a fixed number of random restarts (Cai & Knight, 2013).

# Evaluation Metrics

For the shared task, we have
implemented
a generalization of existing, framework-specific metrics, along the lines
above.
Our goal is for the unified MRP metric to (a) be applicable across different
flavors of semantic graphs, (b) enable labeled and unlabeled variants, as much
as possible, (c)
not require corresponding node anchoring, but (d) minimize concerns about
non-deterministic approximations, and (e) take advantage of anchoring
information when available.
Labeled per-node and per-edge F_{1} score averages will be the
official metric for the task.

The basic principle is that all information presented in the
MRP graph
representations is scored with equal weight, i.e. all applicable tuple
types for each framework.
However, types (E) and (F) are not considered separately (i.e. edges and
their labels), in part because it does not appear desirable to try and give
credit for edges with incompatible labels (e.g. an `ARG1`

with an
`ARG3`

), in part because it makes the search for node-to-node
correspondences somewhat more tractable.
There is no special status (or ‘primacy’) to anchoring in this scheme:
unlike the original SDP, EDM, and UCCA metrics, the MRP scorer searches for
a correspondence relation between the gold-standard and system graphs that
maximizes tuple overlap.
Thus, the MRP approach is abstractly similar to SMATCH, but using a search
algorithm that considers the full range of different tuple types and finds an
exact solution in the majority of cases (while avoiding a few known issues
related to
over-counting,
normalization, and
top nodes).

Anchoring (for all frameworks but AMR) in this scheme is treated on a par with
node labels and properties, labeled edges, and edge attributes.
Likewise, node properties like `tense`

, `mood`

etc.
(in EDS) or `frame`

and `sempos`

(in PTG) are scored with
equal weight as the node labels, as such properties further specify the
semantic predication or disambiguate its predicate sense.
For AMR evaluation, there is an exception to the above principle that all
information in MRP graphs be scored equally:
The MRP representations of AMR graphs preserve the tree-like topology used in
AMR annotations, using ‘inverted’ edges with labels like
`ARG0-of`

.
To make explicit which edges actually are inverted, the MRP graphs provide an
additional `normal`

property, which is present only an inverted
edges and provides the corresponding ‘base’ label (e.g. `ARG0`

).
AMR graphs are standardly evaluated in normalized form, i.e. with inverted
edges restored to their ‘base’ directionality and label.

Time-permitting, we hope to also consider additional cross-framework evaluation
perspectives (e.g. larger sub-graphs in the spirit of the
*complete predications* metric of
Oepen
et al., 2015 or ‘semantic n-grams’ in the SemBleu
proposal of
Song &
Gildea, 2019).
Finally, we will also score parser outputs in the ‘classic’ framework-specific
metrics, for direct comparison to published prior results.

# Software Support

MRP scoring is implemented in the
`mtool`

software (the Swiss Army Knife of Meaning Representation), which is hosted
in a public
Microsoft GitHub
repository to stimulate community engagement.
`mtool`

implements a refinement of the maximum common edge subgraph
(MCES) algorithm by
McGregor (1982), scheduling candidate node-to-node
correspondences based on pre-computed per-node rewards and upper bounds on
adjacent edge correspondences.
In addition to the cross-framework MRP metric, the tool also provides
reference implementations of the SDP, EDM, SMATCH, and UCCA metrics, in the
case of SDP and UCCA generalized to support character-based anchoring (rather
than using token indices).

Value comparison in MRP evaluation will be robust to ‘uninteresting’ variation,
i.e. different encoding of essentially the same information.
Specifically, literal values will always be compared as *case-insensitive*
strings, such that for example `42`

(an integer) and `"42"`

(a string) are considered equivalent, as are `"Pierre"`

and
`"pierre"`

;
this applies to node and edge labels, node properties, and edge attributes.
Anchor values will be normalized for comparison into sets of non-whitespace character
positions: assuming the underlying input string contains whitespace at character
position `6`, the following are considered equivalent:

`[{"from": 0, "to": 13}]`

`[{"from": 0, "to": 6}, {"from": 7, "to": 13}]`

Furthermore, character positions corresponding to basic punctuation marks in the left or right periphery of a normalized anchor will be discarded for comparison:

`.`

`?`

`!`

`;`

`,`

`:`

`“`

`"`

`”`

`‘`

`'`

`’`

`(`

`)`

`[`

`]`

`{`

`}`