In this week of the coding phase (the last week for phase-1), I looked into benchmarking methods and understanding dependencies.
BenchMarking our methods
Given our goal of trying to obtain highly trustable triplets from an abtsract which is give, it is impprtant to be able to benchmark or compare our results with a certain standard so that we know which methods are working better. The followinf can be done by multiple methods
- creating our testsuites
- Using online datasets
Although testsuites can be created such as the following
Asexual_reproduction,Simonides_of_Ceos,Hawaii
1
('Asexual_reproduction', 'hypernym', 'reproduction')
('agamogenesis', 'hypernym', 'reproduction')
('archaebacteria', 'hypernym', 'organisms')
('eubacteria', 'hypernym', 'organisms')
('protists', hypernym', 'organisms')
('conjugation', 'hypernym', 'lateral_gene_transfer')
('transformation', 'hypernym', 'lateral_gene_transfer')
('transduction', 'hypernym', 'lateral_gene_transfer')
2
('Simonides', 'hypernym', 'poet')
('Simonides', 'bornAt', 'Ioulis')
('Simonides_of_Ceos', 'associatedWith', 'epitaphs')
3
('Hawaii', 'hypernym', 'state')
('Hawaii', 'locatedOn', 'Oceania')
('Hawaii', 'comprisedOf', 'islands')
('Hawaii', 'encompasses', 'archipelago')
('Hawaii', 'called', 'Big_Island')
('Hawaii', 'capital', 'Honolulu')
it is still susceptible to an error margin (the other testsuites developed can be found at this link). A better method would be using external datasets which have already been tested and are well defined, however in our case most datatsets involve hypernyms datasets. Such as the following facebook hypernym suite. - link
This can be integrated into our models, in order to help us ensure better testing
from GSoC2019.hypernymysuite.hypernymysuite.base import HypernymySuiteModel
class FBHypernymBench(HypernymySuiteModel):
def get_hypernyms(self, triplets):
'''
triplets have the following form, (subject [attrs], predicate [attrs], object [attrs])
In order to identify hypernyms, we find triplets which have a predicate of a helping word
'''
self.triplets_list = triplets
hypernyms = list()
clean_hypernyms = list()
for triplet in triplets:
predicate = triplet[1]
if predicate[0] in HYPERNYMY_PREDICATES:
hypernym = (triplet[0], triplet[2])
clean_hypernym = (triplet[0][0], triplet[2][0])
hypernyms.append(hypernym)
clean_hypernyms.append(clean_hypernym)
self.hypernyms = hypernyms
self.clean_hypernyms = clean_hypernyms
def predict()
(the following test-results are not available for the time being because one of the datasets, needs to be obtained separately)
Another dataset which does cover hypernyms again and not the complete triplets is the following - link
Looking at Dependencies for triplet extraction
Till now we have looked at methods which look at the parse tree structure and the POS to determine triplets, we now look at dependencies to be able to extract triplets. (although we talk about dependencies we will be requiring POS and the parse tree structure to some extent to help us)
Though the process is not direct, we will be using intuitive tecniques to help us extract triplets. Given we are to find hypernyms, and triplets which have a predicate, hypernyms are rather easier to discover, which are simply noun to relations. Given a limited number of nouns, we have a limited number of noun pairs to take care of as described in the following pair - link
Thus identifying noun deps as a BFS problem between these tuples
dependencies = [(('element', 'NN'), 'nsubj', ('Astatine', 'NNP')), (('element', 'NN'), 'cop', ('is', 'VBZ')), (('element', 'NN'), 'det', ('a', 'DT')), ... ]
could give us hypernyms
def short_relations(self, dependencies, width):
'''
width is the number of nodes between the source and destination
'''
direct_relations = list()
short_relations = list()
for connection in dependencies:
node_1 = connection[0]
node_2 = connection[2]
if node_1[1] in self.Constants.NOUNS and node_2[1] in self.Constants.NOUNS:
direct_relations.append(connection)
.
.
.
Other types of triplets would be a more challenging task, given we have to find the correct verb which identifies relationship between two nouns. We could implement a similar method to help us find such verbs, where again between two nouns, we find the set of verbs which occur and one of them can be selected for the predicate verb.
Based on these two methods, we develop code, and test it on a set of abstracts.
A sample test can be found here at this link for hypernyms - link