Google Summer of Code

The following page is a blog about my Google Summer of Code Project 2019, under DBpedia. My project is "Tool to generate RDF triples from DBpedia abstract".

Week - 9 :- Language supprt and extended configurations.

28 Jul 2019 » GSoC

The last week concluded the second evaluation of my GSoC project, which I successfully completed! This week was extending configuration and adding multi-lingual support.

Extending Configuration

The web-app configuration page now suppports the following -

  • Adding your own hearst patterns (this may be specific to you or your language chosen)
  • lexicalisation of dbpedia properties

Language Supprt

The Web-app now supports three more languages which are

  • German
  • French
  • Spanish

This includes all methods, from hearst patterns to parse-tree and dependencies. This coupled along with spotlight according the language which is to be used can be used to generate triples from the following 5 languages. However default hearst patterns, only are available for English

image

Language support for spotlight was added by deploying docker images for the respective spotlight languages.

English	docker run -i -p 2222:80 dbpedia/spotlight-english spotlight.sh
French	docker run -i -p 2225:80 dbpedia/spotlight-french spotlight.sh
German	docker run -i -p 2226:80 dbpedia/spotlight-german spotlight.sh
Spanish	docker run -i -p 2231:80 dbpedia/spotlight-spanish spotlight.sh

Adding language support to the application meant major code refactoring. The changes included adding configurations for the parsers and dependency parsers.

class GermanDependencyParse(SpacyDependencyParser):
    def __init__(self):
        super().__init__(nlp_german)

class FrenchDependencyParse(SpacyDependencyParser):
    def __init__(self):
        super().__init__(nlp_french)

class SpanishDependencyParse(SpacyDependencyParser):
    def __init__(self):
        super().__init__(nlp_spanish)

class TripleExtraction_Deps_Lang(TripleExtraction_Deps):
    def __init__(self, language):
        super().__init__()
        self.dep_parser = None
        if language.lower() == "german":
            self.dep_parser = GermanDependencyParse()
        elif language.lower() == 'french':
            self.dep_parser = FrenchDependencyParse()
        elif language.lower() == "spanish":
            self.dep_parser = SpanishDependencyParse()
        else:
            raise Exception("Given language is not supported or does not exist. Only english, french, german, spanish are supported")

        def dependency_triplets(self, sentence):
            word_tokenized_sent = word_tokenize(sentence)
            dependencies = self.dep_parser.get_dependencies(sentence)
            return dependencies 

Once additional languages classes were made, these could then be integrated into the existing code so that language models could now be made

    if q_language == 'English':
        text_extraction = TripleExtraction_Deps()
    else:
        text_extraction = TripleExtraction_Deps_Lang(q_language)