Training Data
The table below summarizes the training data that will be provided for the task. The task operates as what is at times called a closed track, i.e. participants are constrained in which additional data and pre-trained models are legitimate to use in system development; see below. While some of the semantic graph frameworks in the task continue to evolve and continuously make available revised and extended data, we anticipate that these selections will provide stable reference points for empirical comparison for at least a couple of years following the task.
DM | PSD | EDS | UCCA | AMR | |
---|---|---|---|---|---|
Text Type | newspaper | newspaper | newspaper | mixed | mixed |
Sentences | 35,656 | 35,656 | 35,656 | 6,572 | 56,240 |
Tokens | 802,717 | 802,717 | 802,717 | 138,268 | 1,000,217 |
The DM and PSD data sets are annotations over the exact same selection of texts, which for the previous SemEval tasks have been aligned at the sentence and token levels. As DM was originally derived from EDS, the EDS graphs cover the same texts. The training data for these frameworks draws from a homogeneous source, the venerable WSJ text first annotated in the Penn Treebank (PTB), notably Sections 00–20. As a common point of reference, the task organizers have released a sample of 100 WSJ sentences annotated in all five frameworks in early April 2019.
UCCA training annotations are over web reviews text from the English Web Treebank, and from English Wikipedia articles on celebrities. While in principle UCCA structures are not confined to a single sentence (about 0.18% of edges cross sentence boundaries), passages are split to individual sentences, discarding inter-relations between them to create a standard setting across the frameworks.
AMR annotations are drawn from a wide variety of texts, with the majority of sentences coming from on-line discussion forums. The training corpus also contains newswire, folktales, fiction, and Wikipedia articles.
Because some of the semantic graph banks involved in the shared task have originally been released by the Linguistic Data Consortium (LDC), we will rely on the LDC to distribute the training data to participants under no-cost evaluation licenses. Registration for the task will be a prerequisite to data access. Upon completion of the competition, we will package all task data (including system submissions and evaluation results) for general release by the LDC, as well as make available those subsets that are copyright-free for public, open-source download.
Companion Data
At a technical level, training (and evaluation) data will be distributed in two formats, (a) as sequences of ‘raw’ sentence strings and (b) in pre-tokenized, PoS-tagged, and lemmatized form. For the latter, we provide premium-quality English morpho-syntactic analyses to participants, by training a state-of-the-art dependency parser (the post-futuristic development version of UDPipe; Straka, 2018) on the union of available syntactic training data for English and using jack-knifing (where required) to avoid overlap of morpho-syntactic training data with the texts underlying the semantic graph banks of the task. These parser outputs, in the context of MRP 2019, are referred to as morpho-syntactic companion trees. Whether as merely a source of fairly decent PTB-style tokenization, or as a vantage point for approaches to meaning representation parsing that start from explicit syntactic structure, this optional resource will hopefully offer community value in its own right. The underlying parsing models and software will become publicly available upon completion of the shared task. Additionally, versions of the companion package starting from mid-June 2019 include automatically generated reference anchorings (commonly called ‘alignments’ in AMR parsing) for the AMR graphs in the training data, obtained from the JAMR and ISI tools of Flanigan et al. (2016) and Pourdamghani et al. (2014).
For reasons of comparability and fairness, the MRP 2019 shared task constrains which additional data or pre-trained models (e.g. corpora, word embeddings, lexica, or other annotations) can be legitimately used besides the resources distributed by the task organizers. The overall principle is that all participants should in principle be able to use the same range of data. However, the organizers expect to keep such constraints to the minimum required and invite participants to suggest relevant data or models. To make precise which resources can be used in system development in addition to the data provided by the task organizers, there is an official ‘white-list’ of legitimate resources. The organizers welcome suggestions for additional data to white-list; in case you anticipate wanting to use resources that are not currently on the MRP white-list, please contact the organizers no later than June 3, 2019. The list will be closed and frozen after that date.
Evaluation Data
For all five frameworks, there are established in-domain evaluation sets, which will also serve as test data in the shared task. Additionally, there are common out-of-domain evaluation sets for DM, PSD, EDS, and UCCA (where training data is relatively homogeneous); furthermore, the task organizers will prepare a new (smallish) test set with gold-standard annotations in all frameworks. The instructions for prospective participants provide further information on the nature and scope of evaluation data for MRP 2019.
The evaluation data will be published in the same file format as the training
and companion data, viz. the JSON-based
uniform MRP interchange format.
The target graphs (i.e. the nodes
, edges
,
and tops
fields) will of course not be available until completion
of the evaluation period, but high-quality tokenization, PoS tags,
lemmatization, and syntactic dependency trees will be provided for the
evaluation data in the same manner as through the
morpho-syntactic companion trees for
the training data.
Uniform Graph Interchange Format
Besides differences in anchoring, the frameworks also vary in how they label nodes and edges, and to what degree they allow multiple edges between two nodes, multiple outgoing edges of the same label, or multiple instances of the same property on a node. Node labels for Flavor (0) graphs typically are lemmas, optionally combined with a (morpho-syntactic) part of speech and a (syntactico-semantic) sense or frame identifier. Node labels for the other graph flavors tend to be more abstract, i.e. are interpreted as concept or relation identifiers (where for the vast majority, of course, there too is a systematic relationship to lemmas, lexical categories, and (sub-)senses). Graph nodes in UCCA are formally unlabeled, and anchoring is used to relate leaf nodes of these graphs to input sub-strings. Conversely, edge labels in all cases come from a fixed and relatively small inventory of (semantic) argument names, though there is stark variation in label granularity (ranging between about a dozen in UCCA and around 90 or 100 in PSD and AMR, respectively). For the shared task, we have for the first time repackaged the five graph banks into a uniform and normalized abstract representation with a common serialization format.
The common interchange format for semantic graphs implements the abstract
model of
Kuhlmann & Oepen (2016) as a
JSON-based serialization for
graphs across frameworks.
This format describes general directed graphs, with structured node and
edge labels, and optional anchoring and ordering of nodes.
JSON is easily manipulated in all programming languages and offers
parser developers the option of ‘in situ’ augmentation of the graph
reprensentations from the task with system-specific additional
information, e.g. by adding private properties to the JSON objects.
The MRP serialization is based on the
JSON Lines format,
where a stream of objects is serialized with line breaks as the separator
character.
Each MRP graph is represented as a JSON object with top-level properties
tops
, nodes
, and edges
; these are discussed in more detail
below.
Additionally, the input
property on all graphs presents the ‘raw’
surface string corresponding to this graph; thus, parser inputs for the task
are effectively assumed to be sentence-segmented but not pre-tokenized.
Additional information about each graph is provided as properties
id
(a string), flavor
(an integer in the range
0
–2
), framework
(a string),
version
(a decimal number), and time
(a string,
encoding when the graph was serialized).
The nodes
and edges
values on graphs each are
list-valued, but the order among list elements is only meaningful for the nodes
of Flavor (0) graphs.
Node objects have an obligatory id
property (an integer) and
optional properties called label
, properties
and
values
, as well as anchors
.
The label
(a string) has a distinguished status in evaluation;
the properties
and values
are both list-valued,
such that elements between the lists correspond by position.
Together, the two lists present a framework-specific, non-recursive
attribute–value matrix (where duplicate properties are in principle
allowed).
The anchors
list, if present, contains pairs of
from
–to
sub-string indices into the
input
string of the graph.
Finally, the edge objects in the top-level edges
list all have
two integer-valued properties: source
and target
,
which encode the start and end nodes, respectively, to which the edge is
incident.
All edges in the MRP collection further have a (string-valued)
label
property, although formally this is considered
optional.
Parallel to graph nodes, edges can carry framework-specific
attributes
and values
lists; in MRP 2019, only the
UCCA framework makes use of edge attributes.