Public Release of Open MRP
In November 2021, we are (finally) releasing those parts of the MRP 2019 and 2020 data sets that can be shared publicly, which includes the training, validation, and evaluation splits for all EDS, DRG, and UCCA graphs, as well as the Czech PTG graphs. Please watch this page for further download instructions.
English Data
The table below summarizes the English training, validation, and evaluation data that is provided for the cross-framework track of the shared task. The task operates as what is at times called a closed training regime, i.e. participants are constrained in which additional data and pre-trained models are legitimate to use in system development; see below. While some of the semantic graph frameworks in the task continue to evolve and continuously make available revised and extended data, we anticipate that these selections will provide stable reference points for empirical comparison for at least a couple of years following the task.
EDS | PTG | UCCA | AMR | DRG | |
---|---|---|---|---|---|
Training Data | |||||
Text Type | newspaper | newspaper | mixed | mixed | mixed |
Sentences | 37,192 | 42,024 | 6,872 | 57,885 | 6,605 |
Tokens | 725,165 | 861,719 | 145,536 | 915,791 | 36,394 |
Validation Data | |||||
Text Type | mixed | mixed | mixed | mixed | mixed |
Sentences | 3,302 | 1,664 | 1,585 | 3,560 | 885 |
Tokens | 55,360 | 33,994 | 22,085 | 57,542 | 4,473 |
Evaluation Data | |||||
Text Type | mixed | newspaper | wikipedia | mixed | mixed |
Sentences | 4,040 | 2,507 | 600 | 2,457 | 898 |
Tokens | 58,406 | 49,228 | 15,405 | 42,852 | 4,913 |
The training data for EDS and PTG draws from a homogeneous source, the venerable WSJ text first annotated in the Penn Treebank (PTB), notably Sections 00–20. As a common point of reference, a smallish sample of WSJ sentences annotated in all five frameworks is available for public download.
UCCA training annotations are over web reviews text from the English Web Treebank, and from English Wikipedia articles on celebrities. While in principle UCCA structures are not confined to a single sentence (about 0.18% of edges cross sentence boundaries), passages are split to individual sentences, discarding inter-relations between them to create a standard setting across the frameworks.
AMR annotations are drawn from a wide variety of texts, with the majority of sentences coming from on-line discussion forums. The training corpus also contains newswire, folktales, fiction, and Wikipedia articles.
The texts annotated in the DRG framework are sourced from a wide range of genres, including Tatoeba, News-Commentary, Recognizing Textual Entailment, Sherlock Holmes stories, and the Bible.
Because some of the semantic graph banks involved in the shared task have originally been released by the Linguistic Data Consortium (LDC), we rely on the LDC to distribute the training data to participants under no-cost evaluation licenses. Registration for the task will be a prerequisite to data access. Upon completion of the competition, we will package all task data (including system submissions and evaluation results) for general release by the LDC, as well as make available those subsets that are copyright-free for public, open-source download.
Additional Languages
Transcending its 2019 predecessor shared task, MRP 2020 introduces an additional track on cross-lingual meaning representation parsing. This track provides training and evaluation data in one additional language for four of the five frameworks represented in the English-only cross-framework track (but regrettably not EDS), albeit different languages per framework (owing to scarcity of gold-standard semantic annotations across languages). Cross-lingual training data will be made available to task participants toward the end of May 2020.
PTG | UCCA | AMR | DRG | |
---|---|---|---|---|
Language | Czech | German | Chinese | German |
Training Data | ||||
Text Type | newspaper | mixed | mixed | mixed |
Sentences | 43,955 | 4,125 | 18,365 | 1,157 |
Tokens | 637,084 | 81,915 | 428,055 | 7,479 |
Evaluation Data | ||||
Sentences | 5,476 | 444 | 1,713 | 444 |
Tokens | 79,464 | 8,714 | 39,228 | 1,970 |
Companion Data
At a technical level, training (and evaluation) data is distributed in two formats, (a) as sequences of ‘raw’ sentence strings and (b) in pre-tokenized, PoS-tagged, and lemmatized form. For the latter, we provide premium-quality morpho-syntactic analyses to participants, by training a state-of-the-art dependency parser (the post-futuristic development version of UDPipe; Straka, 2018) on the union of available syntactic training data for each language and using jack-knifing (where required) to avoid overlap of morpho-syntactic training data with the texts underlying the semantic graph banks of the task. These parser outputs, in the context of MRP 2020, are referred to as morpho-syntactic companion trees. Whether as merely a source of fairly decent OntoNotes-style tokenization (the convention used in Universal Dependencies too), or as a vantage point for approaches to meaning representation parsing that start from explicit syntactic structure, this optional resource will hopefully offer community value in its own right. The underlying parsing models and software will become publicly available upon completion of the shared task. Additionally, the companion package will include automatically generated reference anchorings (commonly called ‘alignments’ in AMR parsing) for the English AMR graphs in the training data (obtained from the JAMR and ISI tools of Flanigan et al., 2016, and Pourdamghani et al., 2014,, as well as companion anchorings for the English and German DRG annotations.
For reasons of comparability and fairness, the MRP 2020 shared task constrains which additional data or pre-trained models (e.g. corpora, word embeddings, lexica, or other annotations) can be legitimately used besides the resources distributed by the task organizers. The overall principle is that all participants should in principle be able to use the same range of data. However, the organizers expect to keep such constraints to the minimum required and invite participants to suggest relevant data or models. To make precise which resources can be used in system development in addition to the data provided by the task organizers, there is an official ‘white-list’ of legitimate resources. The organizers welcome suggestions for additional data to white-list; in case you anticipate wanting to use resources that are not currently on the MRP white-list, please contact the organizers no later than June 15, 2020. The list will be closed and frozen after that date.
Evaluation Data
For all five frameworks, there will be held-out (‘unseen’) test sets, for which parser inputs only are made available to participants at the start of the evaluation phase. For EDS, PTG, and UCCA (where training data is relatively homogeneous), the test data will comprise both ‘in-domain’ and ‘out-of-domain’ text, i.e. sentences that are either abstractly similar or dissimilar to the text types represented in the training data. Furthermore, the task organizers will prepare a new (smallish) test set with gold-standard annotations in all frameworks. The instructions for prospective participants provide further information on the nature and scope of evaluation data for MRP 2020.
The evaluation data will be published in the same file format as the training
and companion data, viz. the JSON-based
uniform MRP interchange format.
The target graphs (i.e. the nodes
, edges
,
and tops
fields) will of course not be available until completion
of the evaluation period, but high-quality tokenization, PoS tags,
lemmatization, and syntactic dependency trees will be provided for the
evaluation data in the same manner as through the
morpho-syntactic companion trees for
the training data.
Uniform Graph Interchange Format
Besides differences in anchoring, the frameworks also vary in how they label nodes and edges, and to what degree they allow multiple edges between two nodes, multiple outgoing edges of the same label, or multiple instances of the same property on a node. Node labels for Flavor (0) graphs (present in the MRP 2019 task but not in 2020) typically are lemmas, optionally combined with a (morpho-syntactic) part of speech and a (syntactico-semantic) sense or frame identifier. Node labels for the other graph flavors tend to be more abstract, i.e. are interpreted as concept or relation identifiers (where for the vast majority, of course, there too is a systematic relationship to lemmas, lexical categories, and (sub-)senses). Graph nodes in UCCA are formally unlabeled, and anchoring is used to relate leaf nodes of these graphs to input sub-strings. Conversely, edge labels in all cases come from a fixed and relatively small inventory of (semantic) argument names, though there is stark variation in label granularity (ranging between about a dozen in UCCA and around 90 or 100 in PTG and AMR, respectively). For the shared task, we have for the first time repackaged the five graph banks into a uniform and normalized abstract representation with a common serialization format.
The common interchange format for semantic graphs implements the abstract
model of
Kuhlmann & Oepen (2016) as a
JSON-based serialization for
graphs across frameworks.
This format describes general directed graphs, with structured node and
edge labels, and optional anchoring and ordering of nodes.
JSON is easily manipulated in all programming languages and offers
parser developers the option of ‘in situ’ augmentation of the graph
reprensentations from the task with system-specific additional
information, e.g. by adding private properties to the JSON objects.
The MRP serialization is based on the
JSON Lines format,
where a stream of objects is serialized with line breaks as the separator
character.
Each MRP graph is represented as a JSON object with top-level properties
tops
, nodes
, and edges
; these are discussed in more detail
below.
Additionally, the input
property on all graphs presents the ‘raw’
surface string corresponding to this graph; thus, parser inputs for the task
are effectively assumed to be sentence-segmented but not pre-tokenized.
Additional information about each graph is provided as properties
id
(a string), flavor
(an integer in the range
0
–2
), framework
(a string),
version
(a decimal number), and time
(a string
in YYYY-MM-DD form, encoding when the graph was serialized).
Optionally, graphs can use string-valued provenance
and
source
properties, to record metadata about the underlying
resource from which the MRP encoding has been derived.
The nodes
and edges
values on graphs each are
list-valued, but the order among list elements is only meaningful for the nodes
of Flavor (0) graphs.
Node objects have an obligatory id
property (an integer) and
optional properties called label
, properties
and
values
, as well as anchors
.
The label
(a string) has a distinguished status in evaluation;
the properties
and values
are both list-valued,
such that elements between the lists correspond by position.
Together, the two lists present a framework-specific, non-recursive
attribute–value matrix (where duplicate properties are in principle
allowed).
The anchors
list, if present, contains pairs of
from
–to
sub-string indices into the
input
string of the graph.
Finally, the edge objects in the top-level edges
list all have
two integer-valued properties: source
and target
,
which encode the start and end nodes, respectively, to which the edge is
incident.
For all frameworks except DRG, all edges in the MRP collection further have
a (string-valued) label
property, although formally this is
considered optional.
Parallel to graph nodes, edges can carry framework-specific
attributes
and values
lists; in MRP 2020, only the
PTG and UCCA framework make use of edge attributes.
Starting in June 2020, version 1.1 of the MRP serialization also
(optionally) allows the id
and anchors
fields on
edges and introduces a third, order-coded array anchorings
on
nodes (to record anchors for individual node properties, separate from the
node anchoring at large).
Graph Analysis Software
For format conversion, graph analysis, visualization, and evaluation tasks in the MRP 2020 context, we provide themtool
software (the Swiss Army Knife of Meaning Representation), which is hosted
in a public
Microsoft GitHub
repository to stimulate community engagement.