Invalid Results with whitespace analyzer and Multi match search


(Tarun Kundhiya) #1

These are only 3 documents in my index

</>

"hits": [
{
"_index": "test_index",
"_type": "relationship",
"_id": "AWly5Q6mfOHUG_jfP5e8",
"_score": 1,
"_source": {
"fromEntityId": "a323dd-d43de-d43d3d4-d4334",
"toEntityId": "12345",
"score": 1
}
},
{
"_index": "test_index",
"_type": "relationship",
"_id": "AWly5Q6mfOHUG_jfP5e9",
"_score": 1,
"_source": {
"fromEntityId": "a323dd-48534-d43d3d4-hd4738f",
"toEntityId": "123455",
"score": 2
}
},
{
"_index": "test_index",
"_type": "relationship",
"_id": "AWly5Q6nfOHUG_jfP5e-",
"_score": 1,
"_source": {
"fromEntityId": "784hd4-48534-784d43-hd4738f",
"toEntityId": "1234556",
"score": 3
}
}
]

When i analyze this text [2 whole keys space separated]

{
  "text" : "784hd4-48534-784d43-hd4738f a323dd-48534-d43d3d4-hd4738f",
  "analyzer" : "whitespace"
}

I get these tokens which is okay

 {
            "token": "784hd4-48534-784d43-hd4738f",
            "start_offset": 0,
            "end_offset": 27,
            "type": "word",
            "position": 0
        },
        {
            "token": "a323dd-48534-d43d3d4-hd4738f",
            "start_offset": 28,
            "end_offset": 56,
            "type": "word",
            "position": 1
        }

But in multi match search result i get nothing

</>

{
  "query" : {
    "multi_match" : {
      "query" : "784hd4-48534-784d43-hd4738f 323dd-48534-d43d3d4-hd4738f",
	  "analyzer" : "whitespace",
      "fields" : [ "fromEntityId" ,  "toEntityId" ],
      "operator" : "or"
      }
  }
}

"hits": [  ]

This should return both those documents. Please help


(Gordon Brown) #2

What mapping is in use on the index? You generally shouldn't specify the analyzer at query time, it should be set on the field definition in the mapping - if the analyzer isn't set in the mapping, text fields will be indexed using the standard analyzer, which will split based on the - characters as well. I suspect that's what's happening in this case - can you try re-creating your index with the analyzer field set to whitespace in the mapping, similar to the example shown here and re-try the query?


(Tarun Kundhiya) #3

Hi Gordon, thanks.

We tried PUTTING custom analyzer and then re indexed all our indices. Now it works without specifying the analyzer while querying.

But is there any support that we can specify the analyzer during the query time?

This doc summarizes something like that.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html

Thus we believe there should be something to override the index analyzers and give us the flexibility to use different analyzer chosen at runtime.

And it is very hard to re create indices at large scale. Please let us know if there exists any solution/hack for this.


(Gordon Brown) #4

I'm glad that helped out! I think to help clarify why what you were doing before wasn't working, we're going to have to dive into how Elasticsearch indexes text.

Specifying an analyzer at query time is supported, but what's critical to note is that if you specify an analyzer at query time, that analyzer will only be applied to the query, not to the data already in the index.

Here's what I mean by that: When you index a document in Elasticsearch, the text fields in that document are analyzed using either the specified analyzer or the default standard analyzer if no analyzer is specified. The output of the analyzer is a list of tokens, and those tokens are what is used to look up the document.

To use your example, when you indexed the documents into Elasticsearch, the standard analyzer was used, and got tokenized like this:

{
  "tokens" : [
    {
      "token" : "784hd4",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "48534",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "784d43",
      "start_offset" : 13,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "hd4738f",
      "start_offset" : 20,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

And those tokens were used to build the inverted index. So you could search for 78d43, and that document would get returned.

But when you use the whitespace analyzer in the query, the search was analyzed like this:

{
  "tokens" : [
    {
      "token" : "784hd4-48534-784d43-hd4738f",
      "start_offset" : 0,
      "end_offset" : 27,
      "type" : "word",
      "position" : 0
    }
  ]
}

As one token. And that one token, 784hd4-48534-784d43-hd4738f wasn't in the index (which would have contained the smaller tokens [784hd4, 48534, 784d43, hd4738f] the string was broken up into by the standard analyzer), so no documents were returned. This is why the same analyzer is usually used at index and at search time, so that the same string generates the same set of tokens.

The analyzer in the mapping controls how data is indexed, and changing that after the fact is very expensive - it would effectively just be a reindex, as Elasticsearch would have to re-analyze every document and rebuild the index every time the analyzer is changed.

If you want to know more, I highly recommend the Inverted Index and Analysis and Analyzers sections of the Elasticsearch Definitive Guide - the guide itself is a bit out of date, but those sections are still highly relevant to how Elasticsearch stores and searches data.


(Gordon Brown) #5

As a follow-up to this, if you want to index the same text multiple ways (e.g. using different analyzers), one way to do this is using multi-fields. Depending on your use case, that may be helpful - you'll still need to set up the analyzers you need ahead of time, but you can switch between them by choosing which sub-field to query.


(Tarun Kundhiya) #6

Thank you Gordon for the detailed explanation and the multi-fields usage tip.