The last week concluded the First term of the GSoC coding phase, which I have successfully passed :) ! This week I got to work on developing the final product - A web application and pipelining DBpedia’s spotlight.
DBpedia spotlight
Although extracting triplets is our goal, producing dbpedia annotated triplets, is what our final product must be able to do. This would mean annotating the subject, the object and converting our predicate into a certain property. For now we deal with the annotations of the subject and predicate. We thus create our pipeline class and add this feature into our triplet extraction process.
class Spotlight_Pipeline(object):
def __init__(self):
self.spotlight_config = spotlight.Config()
self.spotlight_address = self.spotlight_config.spotlight_address
def read_annotations(self, annotations):
return [ i['URI'] for i in annotations ]
def annotate_word(self, word):
try:
annotations = spotlight.annotate(self.spotlight_address,
word)
return self.read_annotations(annotations)
except spotlight.SpotlightException:
print("URI not found")
return word
spipe = Spotlight_Pipeline()
.
.
triple = textraction.treebank(sentence)
annotated_triple = ((spipe.annotate_word(triple[0][0]), triple[0][1]), triple[1], (spipe.annotate_word(triple[2][0]), triple[2][1]))
A sample of annotated texts from this pipeline can be found at this link
The WebApp
This is final product which we will be developing. The link to the web-app can be found here. The app can be very easily be deployed onto heroku.
The application is being run using python-flask on the backend and HTML, JS on the frontend. Key things the web-application should be able to do
- take in a descent size text-input
- Be able to extract triplets and annotate them as well (as DBpedia properties), this would require calls to the spotlight API as well.
- The triplet extraction method may also have to make calls to a stanford CORENLP server for annotation.
- finally display these triplets with a confidence value.