Google Summer of Code

The following page is a blog about my Google Summer of Code Project 2019, under DBpedia. My project is "Tool to generate RDF triples from DBpedia abstract".

Week - 1 :- Getting Started.

02 Jun 2019 » GSoC

In the first week of the coding period, I implemented the method to find triplets using hearts patterns. The hearts patterns developed included ‘bornOn’ which has subcategories of month, day and year and similarly ‘diedOn’, which also comprises of the same sub categories.

The code for the hearst patterns can be found at - https://github.com/sahitpj/SyntaxNet-triplets/blob/master/src/hpatterns.py.

The first process in doing so would be to obtain a POS tagging of the sentence(s) for which you would to find hearts patterns. This can be done by passing them through SyntaxNet. (the steps to setting up syntaxnet and using it can be found here - https://github.com/sahitpj/GSoC-codebase/tree/master/syntaxnet-wrapper).

After the following is done. Then we use the conll parsing tool which can help us extract the parse tree in a usable format.

from conllu import parse_single #for parsing a conllu which has a single sentence

data_file = open("sample.conll", "r", encoding="utf-8")
tokenlist = parse_single(data_file)

print(tokenList)

Given that the hearst pattern recognition program has been integrated with the conll parse tool, the same type can then be used to feed into the hearst pattern finder.

The get_noun_chunks function helps identify noun chucks in the sentence which is important for hearst pattern identification. We do not consider parts of speech in this baaic hearst pattern and remember this is the surface POS tagged sentence.

l = tokenlist[0] #first sentence
print(l.get_noun_chunks())

The find_hearstpatterns function then goes through all the defined hearst patterns to determine if there is a match with any of the patterns available. The function then spits out the triplet, which is of the form [subject, object, predicate]. Here the predicate represents the type of relation between the subject and the object. The predicate however will not be directly be present in the sentence.


self.__hearst_patterns = [
            ('(NP_\\w+ (, )?such as (NP_\\w+ ? (, )?(and |or )?)+)', 'first', 'typeOf', 0),
            ('(such NP_\\w+ (, )?as (NP_\\w+ ?(, )?(and |or )?)+)', 'first', 'typeOf', 0),
            ('((NP_\\w+ ?(, )?)+(and |or )?other NP_\\w+)', 'last', 'typeOf', 0),
            ('(NP_\\w+ (, )?include (NP_\\w+ ?(, )?(and |or )?)+)', 'first', 'typeOf', 0),
            ('(NP_\\w+ (, )?especially (NP_\\w+ ?(, )?(and |or )?)+)', 'first', 'typeOf', 0),
            (r'NP_(\w+).*born.*on.* (\d+)? (\w+) (\d+)? ', 'last', 'bornOn', 4),
            (r'NP_(\w+).*(died|passed away).*on.* (\d+)? (\w+) (\d+)? ', 'last', 'diedOn', 4),
        ]


h  = HearstPatterns()
hearst_patterns = h.find_hearstpatterns(path_to_conll_file)

The demo file can be found in collu package [https://github.com/sahitpj/conllu/blob/0e661f02378ef0f757fb8d9a70bde6215770289c/demo.py]. In order to run the following demo.py with your sentence. Make sure to send the required .conll file to the functions.