Google Summer of Code

The following page is a blog about my Google Summer of Code Project 2019, under DBpedia. My project is "Tool to generate RDF triples from DBpedia abstract".

Week - 11 :- Improvement of the Dependencies method and RDF generation

11 Aug 2019 » GSoC

In this week the dependencies method was improved on and the RDF generation tool was added.

Dependencies Method

Given the results of the baseline test, the dependencies method was improved on in the following week, by better identifying the subjects and objects of the triples. Hypernyms too were checked for validity while processing it. Entites extracted from the text were used for the process

def get_entites(self, sentence):
        doc = nlp(sentence)
        return [ ent.text for ent in doc.ents ]

    def tripletsEntityCheck(self, triplets, sentence):
        """
        Replaces main words with the entire entites they reperesent
        """
        entity_triples = list()
        entities = self.get_entites(sentence)
        for triple in triplets:
            triple = list(triple)
            for entity in entities:
                if triple[0] in entity:
                    triple[0] = entity
                    break
            for entity in entities:
                if triple[2] in entity:
                    triple[2] = entity
                    break
            if triple[0] != triple[2]:
                entity_triples.append(triple)
        return entity_triples

Dependencies with coreferences too has been added as a method for the configuration page.

Configuration page edits

The configuration pages now has two input text boxes. One for adding hearst patterns and the second one for adding dbpedia properties and lexicalizations. These inputs can be used to override the already existent ones. The about page was also updated to add context about the hearst patterns, the pre-defined hearst patterns and the DBpedia properties/lexicalizations.

RDF text

The web-app now allows you to get the RDF text for the triples being generated. This option can be found at the bottom of the results page only if spotlight is enabled in the configuration. Python’s rdflib was used for the following. The library was helpful in serializing the text

def writeRDFtoFile(annotated_triples, triples, destination):
    """
    returns string with the triples in the rdf 'turtle' format. This can be written to file
    """
    rdfGraph = Graph()
    for i in range(len(triples)):
        subject = None
        predicate = None
        obj = None
        """
        subject checking 
        """
        if annotated_triples[i][0]:
            subject = URIRef(annotated_triples[i][0][0])
        else:
            subject = BNode()
            rdfGraph.add( (subject, FOAF.name, Literal(triples[i][0])) )
        .
        .
        .
    rdfGraph.serialize(format='turtle', destination=destination)