Advanced warning, this post is quite long and contains a lot of
questions. Apologies in advanced
I'm currently working with a university helping them to implement a test
suite to further refine some research they have been conducting. Their
research is based around dynamic schema searching. After spending some time
evaluating the various open source search solutions I settled on
elasticsearch as the base platform and I am wondering what the best way to
proceed would be. I have spent about a week looking into the elasticsearch
documentation and the code itself and also reading the documentation of
Lucene but I am struggling to see a clear way forward. (On a side note, I
was getting frustrated by the lack of documentation in the elasticsearch
code. I did a quick grep to find how many classes in the codebase have an
empty class level documentation placeholder. The result was 1378 classes.
Is there any work going on to rectify this?)
The goal of the project is to provide the researches with a piece of
software they can use to plugin revisions of the searching algorithm to
test and refine. They would like to be able to write the pluggable
algorithm in languages other then Java that is supported by the JVM like
Groovy, Python or Closure but that isn't a hard requirement. Part of that
will be to provide them with a front end to run queries and see output and
an admin interface to add documents to an index. I am comfortable with all
of that thanks to the very powerful and complete REST API. What I am not so
sure about is how to proceed with implementing the pluggable search
The researcher's algorithm requires 4 inputs to function:
- The query terms(s).
- A Word (term) x Document matrix across a index.
- A Document x Word (term) matrix across a index.
- A Word (term) frequency list across a index. That is how many times
each word appears across the entire index.
For their purposes, a document doesn't correspond to an actual real-world
document (they actually call them text events). Rather, for now, it
corresponds to one sentence (having that configurable might also be
useful). I figure the best way to handle this is to break down documents
into their sentences (using Apache Tika or something similar), putting each
sentence in as its own document in the index. I am confident I can do this
in the Admin UI I provide using the mapper-attachement plugin as a starting
point. The downside is that breaking up the document before giving it to
elasticsearch isn't a very configurable way of doing it. If they want to
change the resolution to their algorithm, they would need to re-add all
documents to the index again. If the index stored that full documents as is
and the search algorithm could chose what resolution to work at per query
then that would be perfect. I'm not sure it is possible or not though.
The next problem is how to get the three inputs they require and pass it
into their pluggable search algorithm. I'm really struggling where to start
with this one. It seems from looking at Luecene that I need to provide my
own search/query implementation, but I'm not sure if this is right or not.
There also doesn't seem to be any search plugins listed on the
elasticsearch site, so I'm not even sure if it is possible. The important
things here are that the algorithm needs to operate at the index level with
the query terms available to generate its schema before using the schema to
score each document in the index. From what I can tell, this means that the
scripting interface provided by elasticsearch won't be of any use. The
description of the scripting interface in the elasticsearch guide makes it
sound like a script operates at the document level and not the index level.
Other concerns/considerations are the ability to program this algorithm in
a range of languages (just like the scripting interface) and the ability to
augment what is returned by the REST API for a search to include the schema
the algorithm generated (which I assume means I will need to define my own
Can anybody give me some advice on where to get started here? It seems like
I am going to have to write my own search plugin that can accept scripts as
it's core algorithm. The plugin will be responsible for organising the 4
inputs that I outlined earlier before passing control to the script. It
will also be responsible for getting the output from the script and
returning it via it's own REST API. Does this seem logical? If so, how do I
get started with doing this? What parts of the code do I need to look it?
If you have managed to read down this far then much gratitude to you. If
you can help me at all I'd really appreciate it.