Can Elastic run complex search using annotated tokens?

Hmm I have many thoughts, and they're maybe lower level than you're currently thinking. This may be a meaty Lucene problem. I'm not sure your level of expertise, so I may talk about things you already know about.

Nevertheless here's an outline of two possible solutions I can think of

1. Fenegling Combo Analyzers?

One solution is analysis based, the other is query based. If you want a background on analysis, I happened to just blog about it.

Anyway, long story short, analysis takes text and converts it to tokens. It effectively creates a token stream. And this isn't necessarily linear. Certain steps, like synonyms, inject tokens that overlap in position. So "bark" and "woof" might occupy position 2 in your example.

Now to bring it to an even more advanced place, there's a set of analyzers known as combo analyzers that emit parallel token graphs for a given piece of text. You might have one token graph that's English text with stemming, synonyms, and other normalization steps turned on. Another parallel graph might have everything turned off and just represent the exact text.

So in this approach you would emit 5 parallel token streams. To get this to work, at my first think it seems you may need to fenegle your data a bit and get a custom analyzer in the mix (custom as in Java code) to emit the token stream to your liking. For example, you'd might even want to prefix the token emitted with a type in the text so you'd end up with tokens that look like

posn 0     posn 1          posn 2
[word_the] [word_dog] [word_barks]
[pos_article] [pos_noun] [pos_verb]

A simple phrase query for word_dog pos_verb would deliver a dog followed by a verb.

2. Custom Lucene Query?

The other option is to place the sentence in five different fields in Elasticsearch. Say word, pos, etc. Then write a custom Lucene query that can perform position aware search over multiple fields. When a user asks for word:dog (followed by) pos:verb, you'd need to get your hands dirty in Lucene code to collect & score the results yourself. You would need to dig into how Lucene's phrase query works and write a custom plugin for Elasticsearch.

##3. Type as Payloads?
I lied, maybe there's a third option. You could encode the type as a payload with the token perhaps? Payloads are a bit of metadata attached to each token that gets indexed. Instead of a multi-field aware phrase query, you'd need a payload aware phrase query.

This is an interesting problem, and the sort of thing I love to chew on. Don't be shy about emailing me, maybe we can talk through it on a hangout.

Hope that helps, and maybe others can tell you I'm crazy or my answer might help them think of something even better

Cheers
-Doug

2 Likes