Uniform Graph Interchange Format

Besides differences in anchoring, the frameworks also vary in how they label nodes and edges, and to what degree they allow multiple edges between two nodes, multiple outgoing edges of the same label, or multiple instances of the same property on a node.  Node labels for Flavor (0) graphs typically are lemmas, optionally combined with a (morpho-syntactic) part of speech and a (syntactico-semantic) sense or frame identifier.  Node labels for the other graph flavors tend to be more abstract, i.e. are interpreted as concept or relation identifiers (where for the vast majority, of course, there too is a systematic relationship to lemmas, lexical categories, and (sub-)senses).  Graph nodes in UCCA are formally unlabeled, and anchoring is used to relate leaf nodes of these graphs to input sub-strings.  Conversely, edge labels in all cases come from a fixed and relatively small inventory of (semantic) argument names, though there is stark variation in label granularity (ranging between about a dozen in UCCA and around 90 or 100 in PSD and AMR, respectively).  For the shared task, we have for the first time repackaged the five graph banks into a uniform and normalized abstract representation with a common serialization format.

The common interchange format for semantic graphs implements the abstract model of Kuhlmann & Oepen (2016) as a JSON-based serialization for graphs across frameworks.  This format describes general directed graphs, with structured node and edge labels, and optional anchoring and ordering of nodes.  JSON is easily manipulated in all programming languages and offers parser developers the option of ‘in situ’ augmentation of the graph reprensentations from the task with system-specific additional information, e.g. by adding private properties to the JSON objects.  The MRP serialization is based on the JSON Lines format, where a stream of objects is serialized with line breaks as the separator character.  Each MRP graph is represented as a JSON object with top-level properties nodes and edges; these are discussed in more detail below.  The input property on all graphs presents the ‘raw’ surface string corresponding to this graph; thus, parser inputs for the task are effectively assumed to be sentence-segmented but not pre-tokenized.  Additional information about each graph is provided as properties id (a string), flavor (an integer in the range 02), framework (a string), version (a decimal number), and time (a string, encoding when the graph was serialized).

The nodes and edges values on graphs each are list-valued, but the order among list elements is only meaningful for the nodes of Flavor (0) graphs.  Node objects have an obligatory id property (an integer) and optional properties called label, properties and values, as well as anchors.  The label (a string) has a distinguished status in evaluation; the properties and values are both list-valued, such that elements between the lists correspond by position.  Together, the two lists present a framework-specific, non-recursive attribute–value matrix (where duplicate properties are in principle allowed).  The anchors list, if present, contains pairs of fromto sub-string indices into the input string of the graph.  Finally, the edge objects in the top-level edges list all have two integer-valued properties: source and target, which encode the start and end nodes, respectively, to which the edge is incident.  All edges in the MRP collection further have a (string-valued) label property, although formally this is considered optional.  Parallel to graph nodes, edges can carry framework-specific properties and values lists; in MRP 2019, only the UCCA framework makes use of edge properties.

Training Data

The table below summarizes the training data that will be provided for the task.  No additional semantic annotations can be used during system development. In other words, the task formally operates as what is often called a closed track, i.e. there will be a fixed inventory of data and tools that are legitimate for participants to use (e.g. pre-trained word embeddings, syntactic analyzers, and such).  However, the organizers welcome suggestions for relevant such ‘companion’ resources to sanction, with a closing date of May 13, 2019.  While some of the semantic graph frameworks in the task continue to evolve and continuously make available revised and extended data, we anticipate that these selections will provide stable reference points for empirical comparison for at least a couple of years following the task.

DMPSDEDSUCCAAMR
Text Type newspapernewspapernewspaperWikipediamixed
Sentences 35,65635,65635,6564,11357,885
Tokens 802,717802,717802,717124,9351,054,772

The DM and PSD data sets are annotations over the exact same selection of texts, which for the previous SemEval tasks have been aligned at the sentence and token levels.  As DM was originally derived from EDS, the EDS graphs cover the same texts.  The training data for these frameworks draws from a homogeneous source, the venerable WSJ text first annotated in the Penn Treebank (PTB), notably Sections 00–20.  As a common point of reference, the task organizers have released a sample of 100 WSJ sentences annotated in all five frameworks in early April 2019.

UCCA training annotations are mostly over text from the English Web Treebank and from English Wikipedia articles on celebrities.  While in principle UCCA structures are not confined to a single sentence (about 0.18% of edges cross sentence boundaries), passages are split to individual sentences, discarding inter-relations between them to create a standard setting across the frameworks. AMR annotations are drawn from a wide variety of texts, with the majority of sentences coming from on-line discussion forums.  The training corpus also contains newswire, folktales, fiction, and Wikipedia articles.

Because some of the semantic graph banks involved in the shared task have originally been released by the Linguistic Data Consortium (LDC), we will rely on the LDC to distribute the training data to participants under no-cost evaluation licensesRegistration for the task will be a prerequisite to data access.  Upon completion of the competition, we will package all task data (including system submissions and evaluation results) for general release by the LDC, as well as make available those subsets that are copyright-free for public, open-source download.

Companion Data

At a technical level, training (and evaluation) data will be distributed in two formats, (a) as sequences of ‘raw’ sentence strings and (b) in pre-tokenized, PoS-tagged, and lemmatized form.  For the latter, we will seek to provide premium-quality English morpho-syntactic analyses to participants, by training a state-of-the-art dependency parser on the union of available syntactic training data for English and using jack-knifing (where required) to avoid overlap of morpho-syntactic training data with the texts underlying the semantic graph banks of the task.  At least for approaches to meaning representation parsing that assume explicit syntactic structure as their point of departure, these morpho-syntactic analyses will offer community value in their own right.

Evaluation Data

For all five frameworks, there are established in-domain evaluation sets, which will also serve as test data in the shared task.  Additionally, there are common out-of-domain evaluation sets for DM, PSD, EDS, and UCCA (where training data is mostly homogeneous), and the the task organizers will prepare an additional (smallish) test set with gold-standard annotations in all frameworks.

Graph Analysis Software

XHTML 1.0 | Last updated: 2019-04-11 (10:04)