Google Summer of Code

The following page is a blog about my Google Summer of Code Project 2019, under DBpedia. My project is "Tool to generate RDF triples from DBpedia abstract".

Week - 3 :- Utilizing the parse tree structure and POS.

16 Jun 2019 » GSoC

In this week of the coding phase, I researched (mainly study) about how the parse tree can be used to identify triplets.

Parse trees

Parse trees are another feature other than POS tagging, by identifying dependency rules between different types of POS and the words themselves. There exist multiple parsers which do so, the most famous of them are Stanford’s coreNLP, and the most recent one being Google’s SyntaxNet.

CoNLL - U mentions 22-24 different types of dependencies which are possible between two words. The intuition is that using these types of dependencies between specific verbs and nouns we should be able to identify the subject, object and predicate quite accurately. But let’s move a step bacl first, how about only using only the structure itself without knowing the dependencies.

conll-u dependencies link

Limitations of hearst patterns

We already know that POS tagging by itself is an important process in being able to identify triplets and we have seen hearst patterns do actually use this quite evidently. However it is clear from hearst patterns that only specific patterns can be identified (the ones which we specify) and these are highly dependent on the fact that our sentence structure is the way we expect it to be, of course this is simply based on intuition.

Key Observations

Of course to use the parse tree, we should be clear about what we are to be looking for. Thus a starter points are to be made. Given am ample set of data for observations, the following were made,

  1. The triplets are formed with a verb as the common ancestor for the two nouns (subject-object)
  2. The verb comes with a prepositon as it child, which can be combined with the verb to make a sensible predicate
  3. The root of the parse tree, if a noun and the immediate child noun, will be the hypernym of the root.
  4. Nouns which are children of Nouns are most likely to be attributes of the parent noun.

Tools

  • Stanford’s coreNLP provides a tool called Treegex, which is combined with coreNLP’s annotating tools link. The tool helps us identify tree specific patterns which we would like to locate in the parse tree being generated. A few example of such patterns include the following
A ,, B   :	A follows B
A , B    : A immediately follows B
A <<, B	 : B is a leftmost descendant of A
  • For POS tagging we will be using Stanford’s coreNLP and SyntaxNet too.
  • We will also use SyntaxNet to help generate the conllu parse tree.

Methods

For identifying triplets we will implement

  • treegex finders which have patterns similar to the observations we have made.
  • an algorithm implemented in the following paper, which uses a treebank parsed output, which stanford coreNLP generates to discover triplets. link
  • finders on the SyntaxNet parse tree generated based on the observations made.

We will compare the following results with the hearst pattern results which we generated last week.

Code and Tests

In our code we implement the following algorithm mentioned in our methods. The link to the algorithm can be found here. The method, finds the subject, predicate and the object along with it’s attributes. The same can be achieved using tregex.

subject = self.find_subject(sentence)
predicate = self.find_predicate(sentence)
object_ = self.find_object(sentence)
# print("triplet - ", subject, predicate, object_)
return (subject, predicate, object_)

This is done for every sentence present in the abstracts, without coreferences, which means each sentence will give us a single triple, if possible.

for sentence in sent_tokenize(value):
        triple = textraction.treebank(sentence)
        triplets.append(triple)

The following is run a randoms subset of dbpedia abstracts, the results are in the form of

abstract-title
(list of tuples which present a triplet extracted from that abstract)

the triplet is of the form -
    ( (subject, [list of attrs]), (predicate, [list of attrs]), (object, [list of attrs]) )

The results for this can be found at this link