How to create a plugin to add a new text search capability?

My team is looking to piggy-back a custom full text search capability into
Elasticsearch 2.1 such that a user can run one query that utilizes the
boolean, keyword and metadata queries of Elasticsearch and also include
this new custom search capability at the very same time.

Each document to be searched in this fashion would have the typical text,
metadata and the analyzed inverted index, and also include a non-analyzed
blob of precomputed data about this text that this new search needs to do
its work.

Ideally a query that included boolean, keyword, metadata and custom components
could work at the document level by delegating the Elasticsearch query
part to current code, and then only if the document satisfies this
criteria would we use the custom data on the document to compute the
custom search score.

My question is, first, is this possible to do in a way that enables all
of the scaling capabilities of Elasticsearch?

And, second, what are the integration points that a plugin would need
to override to make this happen?

Obviously we would need a custom REST entry point to support an
expanded query format for both query-fetch and scan-scroll that
included our new search

But, while there are plenty of example plugins for analyzers, we haven't
found much at all to help us figure out how to create a plugin like this.

Any detailed guidance you can provide to help would be greatly appreciated!

Bob

Yes.

You'll want to write a plugin that adds a QueryParser. Have a look at the code in this directory which is a test in Elasticsearch's core that does that. To get plugins working and testable you should have a look at the jvm-example plugins in the elasticsearch source tree (2.x branch). If maven is your build system then you can pretty much copy what it does. If gradle is your build system then have a look at elasticsearch's master branch. Otherwise you are on your own for getting the plugin built. You can download an existing plugin's zip file and just copy what it does.

Probably not. If you implement a query then you can reuse all the rest stuff. Its pretty easy to add a new rest entry point. I know the delete-by-query plugin in elasticsearch does it so you can look there for an example.

The biggest thing you have to make sure of is that your query is reasonably efficient. Loading the _source is going to work fine for small data sets but doesn't scale well to large ones. But I think you have to try it to see what tradeoffs are ok for you. Hopefully you have enough to get to the point where you can start to play with it and see those tradeoffs. Just make sure to test it on realistic index sizes.

Thanks for the detailed reply! We will dig into the code and references you gave.

I would like a little clarification on the last point about loading the _source for each document
since scaling big is a large part of our goal.

Do we have to read the entire _source to access this one "blob" field so we can get this data to
be able to produce a score for this document? Is there a way to just load this one field we need?

Or, is there some alternate way to store these "blobs" of data that would be more efficient at scale?

Thanks again!

You can store the "blob" in a field, there is no need for _source

Custom data in documents for scoring is not a good idea. You can not expect fast scoring if you want to iterate over "blobs" in all the documents of a result set. It would mean that search time depends on result set size. Do some statistics over indexed terms and run some math. Or store numeric information in index payloads beside indexed terms.

Now you get into the question of the different between stored fields and doc values. Both are Lucene concepts that Elasticsearch builds on. Stored fields, like _source and "store": true on a field, are stored grouped by document. Doc values, like number, are stored column oriented. Doc values are what powers aggregations and function score queries. They are much faster to fetch per hit because they can be compressed really well.

But, like @jprante says, if you have a big blob of data its not generally going to be fast to explode it to calculate the score no matter what you do. You certainly can. Its exposed right now in groovy scripting. But is very slow and probably a poor choice. Generally you are better off writing a fancy analyzer and using builtin queries where possible. That is why there are lots of examples of writing analyzer plugins.

@jprante also mentioned payloads. They are an option too. And well supported by Lucene. They aren't used a ton by Elasticsearch because they require going down to the postings and those are always slower than just reading the terms list. See the Lucene javadoc for more there. I don't know a ton about them though I think @jprante does. Even if its slower than going to the terms list its orders of magnitude faster than blowing open a binary blob from the source.

Very good information. Thanks Nik and @jprante !

The blob data is all numeric (an array of floats) so I will first explore the "index payloads beside indexes terms".