The last week concluded the second evaluation of my GSoC project, which I successfully completed! This week was extending configuration and adding multi-lingual support.
Extending Configuration
The web-app configuration page now suppports the following -
- Adding your own hearst patterns (this may be specific to you or your language chosen)
- lexicalisation of dbpedia properties
Language Supprt
The Web-app now supports three more languages which are
- German
- French
- Spanish
This includes all methods, from hearst patterns to parse-tree and dependencies. This coupled along with spotlight according the language which is to be used can be used to generate triples from the following 5 languages. However default hearst patterns, only are available for English
Language support for spotlight was added by deploying docker images for the respective spotlight languages.
English docker run -i -p 2222:80 dbpedia/spotlight-english spotlight.sh
French docker run -i -p 2225:80 dbpedia/spotlight-french spotlight.sh
German docker run -i -p 2226:80 dbpedia/spotlight-german spotlight.sh
Spanish docker run -i -p 2231:80 dbpedia/spotlight-spanish spotlight.sh
Adding language support to the application meant major code refactoring. The changes included adding configurations for the parsers and dependency parsers.
class GermanDependencyParse(SpacyDependencyParser):
def __init__(self):
super().__init__(nlp_german)
class FrenchDependencyParse(SpacyDependencyParser):
def __init__(self):
super().__init__(nlp_french)
class SpanishDependencyParse(SpacyDependencyParser):
def __init__(self):
super().__init__(nlp_spanish)
class TripleExtraction_Deps_Lang(TripleExtraction_Deps):
def __init__(self, language):
super().__init__()
self.dep_parser = None
if language.lower() == "german":
self.dep_parser = GermanDependencyParse()
elif language.lower() == 'french':
self.dep_parser = FrenchDependencyParse()
elif language.lower() == "spanish":
self.dep_parser = SpanishDependencyParse()
else:
raise Exception("Given language is not supported or does not exist. Only english, french, german, spanish are supported")
def dependency_triplets(self, sentence):
word_tokenized_sent = word_tokenize(sentence)
dependencies = self.dep_parser.get_dependencies(sentence)
return dependencies
Once additional languages classes were made, these could then be integrated into the existing code so that language models could now be made
if q_language == 'English':
text_extraction = TripleExtraction_Deps()
else:
text_extraction = TripleExtraction_Deps_Lang(q_language)