Background
For each of the individual frameworks, there are established ways of evaluating the quality of parser outputs in terms of graph similarity to gold-standard target representations (Dridan & Oepen, 2011; Cai & Knight, 2013; Oepen et al., 2014; Hershcovich et al., 2017). There is broad similarity between the framework-specific evaluation metrics used to date, although there are some subtle differences too. Meaning representation parsing is commonly evaluated in terms of a graph similarity F1 score at the level of individual node–edge–node and node–property–value triples. Variations in extant metrics relate to among others, how node correspondences across two graphs are established, whether edge labels can optionally be ignored in triple comparison, and how top nodes (and other node properties, including anchoring) are evaluated; see the formal background page.
In a nutshell, semantic graphs in all frameworks can be broken down into ‘atomic’ component pieces, i.e. tuples capturing (a) top nodes, (b) node labels, (c) node properties, (d) node anchoring, (e) unlabeled edges, (f) edge labels, and (g) edge attributes. Not all tuple types apply to all frameworks, however.
EDS | PTG | UCCA | AMR | DRG | ||
---|---|---|---|---|---|---|
A | Top Nodes | ✓ | ✓ | ✓ | ✓ | ✓ |
B | Node Labels | ✓ | (✓) | ✗ | ✓ | (✓) |
C | Node Properties | ✓ | ✓ | ✗ | ✓ | ✗ |
D | Node Anchoring | ✓ | (✓) | (✓) | ✗ | ✗ |
E | Directed Edges | ✓ | ✓ | ✓ | ✓ | ✓ |
F | Edge Labels | ✓ | ✓ | ✓ | ✓ | (✓) |
G | Edge Attributes | ✗ | (✓) | (✓) | ✗ | ✗ |
To evaluate any of these tuple types, a correspondence relation must be established between nodes (and edges) from the gold-standard vs. the system graphs. This relation presuposes a notion of node (and edge) identities, which is where the various flavours and frameworks differ. In ‘classic’ bi-lexical (semantic) dependencies—e.g. the DM and PSD graphs from earlier shared tasks; our Flavor (0)—the nodes are surface lexical units (‘tokens’); their identities are uniquely determined as the character range of the corresponding sub-strings (rather than by token indices, which would not be robust to tokenization mis-matches)
In the Flavor (1) graphs (EDS, PTG, and UCCA), multiple distinct nodes can have overlapping or even identical anchors; in EDS, for example, the semantics of an adverb like today is decomposed into four nodes, all anchored to the same substring: implicit_q x: time_n(x) ∧ _today_a_1(x) ∧ temp_loc(e, x). The standard EDS and UCCA evaluation metrics determine node identities through anchors (and transitively the union of child anchors, in the case of UCCA) and allow many-to-many correspondences across the gold-standard and system graphs (Oepen et al., 2014; Hershcovich et al., 2017).
Finally, as Flavor (2) frameworks, nodes in AMR graphs and DRG are unanchored. Thus, node-to-node correspondences need to be established (as one-to-one equivalence classes of node identifiers), to maximize the set of shared tuples between each pair of graphs. Abstractly, this is an instance of the NP-hard maximum common edge subgraph isomorphism problem (where node-local tuples can be modeled as ‘pseudo-edges’ with globally unique target nodes). The SMATCH scorer for AMR approximates a solution through a hill-climbing search for high-scoring correspondences, with a fixed number of random restarts (Cai & Knight, 2013).
Evaluation Metrics
For the shared task, we have implemented a generalization of existing, framework-specific metrics, along the lines above. Our goal is for the unified MRP metric to (a) be applicable across different flavors of semantic graphs, (b) enable labeled and unlabeled variants, as much as possible, (c) not require corresponding node anchoring, but (d) minimize concerns about non-deterministic approximations, and (e) take advantage of anchoring information when available. Labeled per-node and per-edge F1 score averages will be the official metric for the task.
The basic principle is that all information presented in the
MRP graph
representations is scored with equal weight, i.e. all applicable tuple
types for each framework.
However, types (E) and (F) are not considered separately (i.e. edges and
their labels), in part because it does not appear desirable to try and give
credit for edges with incompatible labels (e.g. an ARG1
with an
ARG3
), in part because it makes the search for node-to-node
correspondences somewhat more tractable.
There is no special status (or ‘primacy’) to anchoring in this scheme:
unlike the original SDP, EDM, and UCCA metrics, the MRP scorer searches for
a correspondence relation between the gold-standard and system graphs that
maximizes tuple overlap.
Thus, the MRP approach is abstractly similar to SMATCH, but using a search
algorithm that considers the full range of different tuple types and finds an
exact solution in the majority of cases (while avoiding a few known issues
related to
over-counting,
normalization, and
top nodes).
Anchoring (for all frameworks but AMR) in this scheme is treated on a par with
node labels and properties, labeled edges, and edge attributes.
Likewise, node properties like tense
, mood
etc.
(in EDS) or frame
and sempos
(in PTG) are scored with
equal weight as the node labels, as such properties further specify the
semantic predication or disambiguate its predicate sense.
For AMR evaluation, there is an exception to the above principle that all
information in MRP graphs be scored equally:
The MRP representations of AMR graphs preserve the tree-like topology used in
AMR annotations, using ‘inverted’ edges with labels like
ARG0-of
.
To make explicit which edges actually are inverted, the MRP graphs provide an
additional normal
property, which is present only an inverted
edges and provides the corresponding ‘base’ label (e.g. ARG0
).
AMR graphs are standardly evaluated in normalized form, i.e. with inverted
edges restored to their ‘base’ directionality and label.
Time-permitting, we hope to also consider additional cross-framework evaluation perspectives (e.g. larger sub-graphs in the spirit of the complete predications metric of Oepen et al., 2015 or ‘semantic n-grams’ in the SemBleu proposal of Song & Gildea, 2019). Finally, we will also score parser outputs in the ‘classic’ framework-specific metrics, for direct comparison to published prior results.
Software Support
MRP scoring is implemented in the
mtool
software (the Swiss Army Knife of Meaning Representation), which is hosted
in a public
Microsoft GitHub
repository to stimulate community engagement.
mtool
implements a refinement of the maximum common edge subgraph
(MCES) algorithm by
McGregor (1982), scheduling candidate node-to-node
correspondences based on pre-computed per-node rewards and upper bounds on
adjacent edge correspondences.
In addition to the cross-framework MRP metric, the tool also provides
reference implementations of the SDP, EDM, SMATCH, and UCCA metrics, in the
case of SDP and UCCA generalized to support character-based anchoring (rather
than using token indices).
Value comparison in MRP evaluation will be robust to ‘uninteresting’ variation,
i.e. different encoding of essentially the same information.
Specifically, literal values will always be compared as case-insensitive
strings, such that for example 42
(an integer) and "42"
(a string) are considered equivalent, as are "Pierre"
and
"pierre"
;
this applies to node and edge labels, node properties, and edge attributes.
Anchor values will be normalized for comparison into sets of non-whitespace character
positions: assuming the underlying input string contains whitespace at character
position 6, the following are considered equivalent:
[{"from": 0, "to": 13}]
[{"from": 0, "to": 6}, {"from": 7, "to": 13}]
Furthermore, character positions corresponding to basic punctuation marks in the left or right periphery of a normalized anchor will be discarded for comparison:
.
?
!
;
,
:
“
"
”
‘
'
’
(
)
[
]
{
}