Google Summer of Code

The following page is a blog about my Google Summer of Code Project 2019, under DBpedia. My project is "Tool to generate RDF triples from DBpedia abstract".

Week - 4 :- Benchmarks and Utilising dependencies.

23 Jun 2019 » GSoC

In this week of the coding phase (the last week for phase-1), I looked into benchmarking methods and understanding dependencies.

BenchMarking our methods

Given our goal of trying to obtain highly trustable triplets from an abtsract which is give, it is impprtant to be able to benchmark or compare our results with a certain standard so that we know which methods are working better. The followinf can be done by multiple methods

  • creating our testsuites
  • Using online datasets

Although testsuites can be created such as the following

Asexual_reproduction,Simonides_of_Ceos,Hawaii
1
('Asexual_reproduction', 'hypernym', 'reproduction')
('agamogenesis', 'hypernym', 'reproduction')
('archaebacteria', 'hypernym', 'organisms')
('eubacteria', 'hypernym', 'organisms')
('protists', hypernym', 'organisms')
('conjugation', 'hypernym',  'lateral_gene_transfer')
('transformation', 'hypernym', 'lateral_gene_transfer')
('transduction', 'hypernym', 'lateral_gene_transfer')
2
('Simonides', 'hypernym', 'poet')
('Simonides', 'bornAt', 'Ioulis')
('Simonides_of_Ceos', 'associatedWith', 'epitaphs')
3
('Hawaii', 'hypernym', 'state')
('Hawaii', 'locatedOn', 'Oceania')
('Hawaii', 'comprisedOf', 'islands')
('Hawaii', 'encompasses', 'archipelago')
('Hawaii', 'called', 'Big_Island')
('Hawaii', 'capital', 'Honolulu')

it is still susceptible to an error margin (the other testsuites developed can be found at this link). A better method would be using external datasets which have already been tested and are well defined, however in our case most datatsets involve hypernyms datasets. Such as the following facebook hypernym suite. - link

This can be integrated into our models, in order to help us ensure better testing


from GSoC2019.hypernymysuite.hypernymysuite.base import HypernymySuiteModel

class FBHypernymBench(HypernymySuiteModel):
    def get_hypernyms(self, triplets):
        '''
        triplets have the following form, (subject [attrs], predicate [attrs], object [attrs])
        In order to identify hypernyms, we find triplets which have a predicate of a helping word
        '''
        self.triplets_list = triplets
        hypernyms = list()
        clean_hypernyms = list()
        for triplet in triplets:
            predicate = triplet[1]
            if predicate[0] in HYPERNYMY_PREDICATES:
                hypernym = (triplet[0], triplet[2])
                clean_hypernym = (triplet[0][0], triplet[2][0])
                hypernyms.append(hypernym)
                clean_hypernyms.append(clean_hypernym)
        self.hypernyms = hypernyms
        self.clean_hypernyms = clean_hypernyms

    def predict()

(the following test-results are not available for the time being because one of the datasets, needs to be obtained separately)

Another dataset which does cover hypernyms again and not the complete triplets is the following - link

Looking at Dependencies for triplet extraction

Till now we have looked at methods which look at the parse tree structure and the POS to determine triplets, we now look at dependencies to be able to extract triplets. (although we talk about dependencies we will be requiring POS and the parse tree structure to some extent to help us)

Though the process is not direct, we will be using intuitive tecniques to help us extract triplets. Given we are to find hypernyms, and triplets which have a predicate, hypernyms are rather easier to discover, which are simply noun to relations. Given a limited number of nouns, we have a limited number of noun pairs to take care of as described in the following pair - link

Thus identifying noun deps as a BFS problem between these tuples

dependencies = [(('element', 'NN'), 'nsubj', ('Astatine', 'NNP')), (('element', 'NN'), 'cop', ('is', 'VBZ')), (('element', 'NN'), 'det', ('a', 'DT')), ... ]

could give us hypernyms

    def short_relations(self, dependencies, width):
        '''
        width is the number of nodes between the source and destination
        '''
        direct_relations = list()
        short_relations = list()
        for connection in dependencies:
            node_1 = connection[0]
            node_2 = connection[2]
            if node_1[1] in self.Constants.NOUNS and node_2[1] in self.Constants.NOUNS:
                direct_relations.append(connection)
            .
            .
            .

Other types of triplets would be a more challenging task, given we have to find the correct verb which identifies relationship between two nouns. We could implement a similar method to help us find such verbs, where again between two nouns, we find the set of verbs which occur and one of them can be selected for the predicate verb.

Based on these two methods, we develop code, and test it on a set of abstracts.

A sample test can be found here at this link for hypernyms - link