How we built Elasticsearch plugins for indexing ontologies

Hello! My colleague Matt has just written about a recent project where he's been building ES plugins for ontology indexing. Hope you find it interesting and the plugins useful - and we welcome feedback, especially hints as to where we've gone wrong and/or better ways to do it.

This is part of a publically-funded project here in the UK to develop better open source search software for bioinformaticians - it's called BioSolr for historical reasons but we're also using ES. We'll be talking about this at a workshop event near Cambridge next week.

Nice job!

Implementing custom field mappers is definitely one of the most advanced way of all possible things a plugin can do (aside from custom aggregation methods).

Why not just set up a Lucene analyzer or a token filter for ontology matching? I did something similar when porting the lucene-skos project to ES

I think the comparison to Solr's UpdateRequestProcessor is quite unfair. It's a different approach.

If it's all about lack of documentation, why not just describe the implementation process in a dedicated blog post ? Or open pull requests to add developer notes to the ES doc site? I bet the ES team is always happy about such contributions. Or submitting a new user story. I wish I could document more, too, but I'm very lousy at this job, and can't find enough time.

Personally I am convinced the (well-thought and elaborative) open source code of Elasticsearch is the best documentation possible :slight_smile: but that's not what is of great comfort for most beginners.

++ You won your bet Jörg! We are more than happy when users contribute code, doc, tests, stories, whatever... :slight_smile:

One of the reasons why I have never documented the Elasticsearch internals
(for example the TransportAction families) is that it is a moving target
for which you have no insight about the direction it is traveling in. Every
time I dig deep in the code and want to change something via a plugin, I
learn that such code might go away.



Thanks for the suggestions.

I'm not sure that writing an analyzer or token filter would help for our purposes. The object of the plugin is to retrieve ontology data and add it to the document, so it's not really a question of tokenising, unless I'm missing your point (entirely possible!).

It is a completely different approach to the UpdateRequestProcessor, which I did make clear in the article. There's no real equivalent in ElasticSearch though, so creating a new field mapping seemed like a good approach.

Regarding documentation, as mentioned below, it seems like I'd be documenting a moving target, which would potentially end up with something more misleading than digging through other people's plugins to see how they approached the problem. I might look at a more detailed blog post in the future, though.