Can Elastic run complex search using annotated tokens?


(jeko) #1

Hello,

Is it possible using elastic to run a request that'll return all documents with tokens matching:

word:dog (followed-by) type:verb

With documents looking like this:

    {
      "sentence": "the dog barks",
      "tokens": [{
        "word": "the",
        "type": "article"
      }, {
        "word": "dog",
        "type": "noun"
      }, {
        "word": "barks",
        "type": "verb"
      }]
    }

We have a database of about 200k sentences, 2 millions tokens.

Thanks!


(eliasah) #2

Apache Lucene does not support Part-Of-Speech Tagging. So actually I don't think that can be done within the Elasticsearch query scope.

You'll need to do that using Hadoop or Spark to extract data and then apply the POS-Tagger
algorithm using an NLP framework such as StanfordNLP.


(jeko) #3

Thanks Elias.

In my case, tagging is already done manually. I'm looking for a way to run advanced queries.

An option I have is to generate a string with all possible combination and feed that into elasticsearch?

Example:

  • word:the word:dog word:barks
  • type:article word:dog word:barks
  • word:the type:noun word:barks
  • type:article type:noun word:barks
  • ...

But I have 5 different types of annotations per token. A quick estimation is that I'll have about 22k possible combinations per sentence. Multiplying the size of my database by 22000 isn't an acceptable solution.

Interesting link to StandfordNLP.


(eliasah) #4

Can you please provide us with the mapping of your index? So we can know how your index is structured. It's very hard to answer query questions without knowing it.

Thanks!


(jeko) #5

Actually the mapping can be whatever works.

I showed in the initial question the type of data present in my DB (CouchDB).

    {
      "sentence": "the dog barks",
      "tokens": [{
        "word": "the",
        "type": "article"
      }, {
        "word": "dog",
        "type": "noun"
      }, {
        "word": "barks",
        "type": "verb"
      }]
    }

I can create a new index to solve the problem, reorganize the data in any convenient ways.

In the existing system, I used to feed elasticsearch with full sentences only, allowing to search texts without using the tags.


(Doug Turnbull) #6

Hmm I have many thoughts, and they're maybe lower level than you're currently thinking. This may be a meaty Lucene problem. I'm not sure your level of expertise, so I may talk about things you already know about.

Nevertheless here's an outline of two possible solutions I can think of

1. Fenegling Combo Analyzers?

One solution is analysis based, the other is query based. If you want a background on analysis, I happened to just blog about it.

Anyway, long story short, analysis takes text and converts it to tokens. It effectively creates a token stream. And this isn't necessarily linear. Certain steps, like synonyms, inject tokens that overlap in position. So "bark" and "woof" might occupy position 2 in your example.

Now to bring it to an even more advanced place, there's a set of analyzers known as combo analyzers that emit parallel token graphs for a given piece of text. You might have one token graph that's English text with stemming, synonyms, and other normalization steps turned on. Another parallel graph might have everything turned off and just represent the exact text.

So in this approach you would emit 5 parallel token streams. To get this to work, at my first think it seems you may need to fenegle your data a bit and get a custom analyzer in the mix (custom as in Java code) to emit the token stream to your liking. For example, you'd might even want to prefix the token emitted with a type in the text so you'd end up with tokens that look like

posn 0     posn 1          posn 2
[word_the] [word_dog] [word_barks]
[pos_article] [pos_noun] [pos_verb]

A simple phrase query for word_dog pos_verb would deliver a dog followed by a verb.

2. Custom Lucene Query?

The other option is to place the sentence in five different fields in Elasticsearch. Say word, pos, etc. Then write a custom Lucene query that can perform position aware search over multiple fields. When a user asks for word:dog (followed by) pos:verb, you'd need to get your hands dirty in Lucene code to collect & score the results yourself. You would need to dig into how Lucene's phrase query works and write a custom plugin for Elasticsearch.

##3. Type as Payloads?
I lied, maybe there's a third option. You could encode the type as a payload with the token perhaps? Payloads are a bit of metadata attached to each token that gets indexed. Instead of a multi-field aware phrase query, you'd need a payload aware phrase query.

This is an interesting problem, and the sort of thing I love to chew on. Don't be shy about emailing me, maybe we can talk through it on a hangout.

Hope that helps, and maybe others can tell you I'm crazy or my answer might help them think of something even better

Cheers
-Doug


(jeko) #7

Hello Doug,

Thank you so much for the detailed answer.

My level of expertise is pretty much at the basic user level. From my understanding, a custom analyzer seems the most extensible option? It isn't pushing too much into the internals of elasticsearch, and allows to feed the search engine with what we need, am I right? I'll check how that needs to be implemented, seems like your book is covering the subject!

I like the idea of using as much as possible of what elasticsearch can do without implementing too much of a custom solution.

The token payload option seems attractive as well, would that require both a custom analyzer and custom Lucene query or is this some built-in thing?

We're still only in the "feasibility study" phase. I'll gladly follow up with you with the solutions I implement, and share some thoughts by email when I see more clearly, if the subject is interesting to you.

Best regards,
JC


(Doug Turnbull) #8

Glad to be of help

Thinking through it again... One problem with the combo analyzer will be specifying the input text, as it will need to work on one stream of text. You however need to specify 5 parallel streams of tokens. You could go through and annotate each token with something like this:

the|article|...  dog|noun|...

Then you'd tokenize, producing

[the|article|...] [dog|noun|...]

followed by a custom analysis step that could break out the annotations

[the|article|...] [dog|noun|...]

There IS an annotation analyzer plugin here which you might use with a combo analyzer. Even if it doesn't fit your needs, you could fork it and implement something closer to what you need...

seems like your book is covering the subject!

Yeah we cover custom analyzers for many use cases in some depth in Chapter 4, which is out. I actually didn't write this chapter, but it's my favorite one :). If you want a discount code you can use (turnbullmu to get 38% off)

Cheers!
-Doug


(jeko) #9

This annotation analyzer plugin looks perfect! But I don't understand why I would need a combo analyzer on top of it, would I?


(Doug Turnbull) #10

Oh good point maybe you don't. I didn't get too deep into the combo
analyzer :slight_smile:


(system) #11