Google Summer of Code

The following page is a blog about my Google Summer of Code Project 2019, under DBpedia. My project is "Tool to generate RDF triples from DBpedia abstract".

Week - 2 :- Understanding Hearst patterns (the naive way)

09 Jun 2019 » GSoC

In this week of GSoC (the second week), I focussed on understanding hearst patterns, the naive way.

The Process

As I previously talked about, hearst patterns are all about finding patterns in text which can help us (in our case) find triplets. Given that last week we first implemeted the non-statistical way of identifying hearst patterns which was an overhead view, just to get the feel of it, this week was all diving deep.

A brief overview or a recap of how I am using hearst patterns is by using regex patterns in order to construct specific relations which we want to find. In order to find these, a random sample of about 20-25 abstracts (these are the dbpedia abstracts which were taken from this link - http://downloads.dbpedia.org/2016-10/core-i18n/en/long_abstracts_en.ttl.bz2) were chosen to understand what types of patterns we could pull out from these abstracts. An example of such patterns identified include

_ made|published|developed in _ 
_ made|published|developed by _
_ such as  _
_ was|is a(n) _
_ made|composed of _
_ used by _
_ member of _
_ related|comparable to _ ...

these patterns are then made into their regex forms so that they can be taken from the sentences, and since we would like to find such relations between nouns, and not other words, we preprocess abstracts to identify noun chunks such that they can be matched to our regex. We use SyntexNet to POS tag our abstracts.

Going in depth

The key to getting the required hearst patterns would be to understanding regex itself. Now immediatly we do have some things which the regex should take care of in order to get meaningful results.

  1. Given the complexity of English sentences and the fact that there would be a lot of other words along with the relationship words we are looking for, the regex should completely be able to ignore the other words.
  2. Regex should identify relations from the same sentence which means that noun chunks should be picked from the same sentence which should intuitively give us a meaningful triplet.
  3. Tackling pronouns.

Tackling the first problem is probably the simplest, given an exmaple of a regex which we like to find.

pattern = r'NP_(\w+) (composed) of NP_(\w+)'
# the NP is so that we pull out our patterns from noun chunks which we earlier talked about.

we make it flexible to have extra words between them by converting it to this form

pattern = r'NP_(\w+).*?(composed).*?of.*?NP_(\w+)'
# we add .*? so that extra words are added, but we put the ? so that minimal words are added.

Now coming to the second problem, in order to make sure that we find them from a single sentence, we use a naive method of making sure that the .*? introduced, does not cross over full stop (.). We thus convert the following pattern into this form.

pattern = r'NP_(\w+)[^.]*?(composed)[^.]*?of[^.]*?NP_(\w+)'
# we change the .* to an [^.], which doesn't allow periods which was allowed previously.

The third problem is tackling pronouns. Given that in this week we are exploring naive techniques wihout using the help of cross references which we can obtain from SyntaxNet, the basic solution we use for this is to assume that all pronouns being used in the abstract are being used in place of our subject. We thus replace all pronouns with the actual subject of the abstract and thus we do not have to deal with them. (Pronouns are POS tagged already which can be identified easily.)

elif matches.group(1) == 'PRP':
    noun_chunks.append('NP_' + subject)

Greedy VS Non - Greedy :- Truly understanding Regex

Although a greedy and a non greedy approach would not make a difference in small sentence, when it comes to big sentences,or sentence with a lot of nouns, it does make a difference. The regex we use in general is a greedy regex, this means that as soon as it finds what it looking for, it moves on to finding the next matching part of the regex. Take the following for example

pattern = r'NP_(\w+).*?(composed).*?of.*?NP_(\w+)'

regex first finds the words NP_ and as soon as it finds it, it moves onto the next set of words to find, which are basically composed and of, after which it proceeds to find the next NP_. This is the grredy approach in regex however consider sentences such as this

NP_xyz .. NP_abc, and also since NP_Bronze is composed of NP_copper ...

In such a sentence, (which is still a single sentence) the identified triplet would be - (xyz - composedOf - NP_iron), which may not always be incorrect. In the following sentence if we are to get the triplet we want, we modify our regex to make it non - greedy

pattern = r'.*NP_(\w+).*?(composed).*?of.*?NP_(\w+)'
# by adding an .* in the starting. we are making sure it chooses the smallest pattern

By adding an .* in the starting. we are making sure it chooses the smallest pattern, however the drawback is the amount of calculation which it would require to take. Which on simple observation was found to be at least 3-5 times the time taken for the grredy verson. This a trade off to consider which also may not gaurantee to give better results than the greedy version.

Using other parts of Speech - Adjectives.

Although we have been mostly talking about nouns throughout this post, adjectives and adverbs can seem to help us identify triplets or to be more specific attributes of the subject. However since these adjectives are mostly connected to a noun, they will not affect the number of triplets being generated, but the quality of the noun chunks being generated. Thus the accuracy of triplets being identified would not rely on the adjectives.

Testing our methods.

As testing for the theory we have described, we run the following different methods on 3 test suites which have been randomly identified. Each of these test suites have 1 person, 1 thing and 1 place to cover the diversity of the triplets which can be found. Note - Each abstract is not of the same size, and the number of triplets which we can identify from them will not be the same. We devise the following metric to give us a sense of comparison.

  • We compare the test suites over different methods, however we do not compare test suites which each other, at the end we devise a total score for each method.
  • Each method is compared with the dbpedia page of it, where we check
    • the number of triplets which it was able to identify from the dbpedia page (-1)
    • number of new triplets it was able to identify (-1)
    • number of triplets it wasn’t able to identify from the abstract (ones which can be seen from the abstract) (+3)
    • number of triplets it predicted incorrectly (+0)
    • Lastly and independently, the time taken for them to execute

The methods which we shall use are the following :-

  1. baseline : this does not take care of using pronouns or from the same sentence (naive hearst pattern)
  2. modified-1 : this takes care of pronouns and also same sentence triplets
  3. Non-greedy baseline : same as baseline, however uses a non-greedy approach.
  4. Non-greedy modified-1 : same as modifed-1 but uses a non greedy approach.
  5. Semi-greedy which is a combination of greedy and non-greedy.

The results are represented as score(new_potential_t/identified_existent_ones/t_not_identified/incorrect_t_identified) [t = triplets]

Results

Methods/Testbaselinemodified-1Non-Greedy baselineNon-Greedy modified-1Semi
test suite - 16(0/3/3/4)6(0/3/3/6)10(4/1/5/9)10(4/1/5/9)3(4/2/3/4)
test suite - 24(1/4/3/3)6(0/3/3/5)4(1/4/3/8)4(1/4/3/8)4(1/4/3/4)
test suite - 311(1/3/5/4)11(1/3/5/3)4(4/4/4/7)4(4/4/4/7)3(4/5/3/2)



Time is given in seconds.

Methods/Testbaselinemodified-1Non-Greedy baselineNon-Greedy modified-1Semi
test suite - 10.7180.059385.606399.363399.727
test suite - 20.4470.051336.161374.418362.735
test suite - 30.5290.057262.255279.295287.364



link to test run file - https://github.com/sahitpj/GSoC-codebase/blob/master/tests/week-2_test.py
link to test results data - https://github.com/sahitpj/GSoC-codebase/blob/master/tests/week-2_results.txt

Inference

In comparison of the greedy vs non-greedy, the greedy methods are almost equivalent to non greedy methods in identifying tripletes which are there, however non greedy method are better in identifying new triplets and are also susceptible to identifying wrong triplets.

Clearly our semi greedy method, which is a combination of the two methods discussed, brings us the best results along with a better chance to identify new triplets and also the least susceptibility in identifying incorrect triplets. However the major tradeoff would be in the time taken to find these and. Thus we have identified a descent naive method for hearst pattern recognition.