The same results for the different search criteria

Hi guys,
First off all, I would like to say hi to all of you and ask if you are enjoying the rainy days right now too. :slight_smile:
The second thing is our Elasticsearch version:

curl -XGET 'localhost:9100'
{
  "ok" : true,
  "status" : 200,
  "name" : "Autolycus",
  "version" : {
    "number" : "0.90.13",
    "build_hash" : "249c9c5e06765c9e929e92b1d235e1ba4dc679fa",
    "build_timestamp" : "2014-03-25T15:27:12Z",
    "build_snapshot" : false,
    "lucene_version" : "4.6"
  },
  "tagline" : "You Know, for Search"
}

Overall, our system works pretty well, but there is a problem with few specific documents and search queries.

The annotation mapping looks like this:

"1212" : {
    "annotation" : {
      "properties" : {
        "caseNumber" : {
          "type" : "integer"
        },
        "content" : {
          "type" : "string",
          "analyzer" : "ninstitution"
        },
        "institutionId" : {
          "type" : "string"
        },
        "departmentId" : {
          "type" : "string"
        },
        "person" : {
          "type" : "string",
          "analyzer" : "ninstitution"
        },
        "protocol" : {
          "type" : "boolean"
        },
        "protocolId" : {
          "type" : "string"
        },
        "timeOffset" : {
          "type" : "long"
        },
        "year" : {
          "type" : "integer"
        }
      }
    }
  }

The thing, which I cannot understand is why I get the same results for the two completely different queries:

curl -XGET 'http://localhost:9100/1212/_search?pretty=true' -d '{"size" : 1000,
  "query" : {
    "bool" : {
      "must" : [ {
        "field" : {
          "protocolId" : "121210250002027_1043_12_Kor-716_20160516_080329" 
        }
      }, {
        "field" : {
          "protocol" : true
        }
        } ]
    }
  },
  "sort" : [ {
    "timeOffset" : {
      "order" : "asc" 
    }
  } ]
}'
curl -XGET 'http://localhost:9100/1212/_search?pretty=true' -d '{"size" : 1000,
  "query" : {
    "bool" : {
      "must" : [ {
        "field" : {
          "protocolId" : "121210250002027_1043_12_Kor-708_20160801_094045" 
        }
      }, {
        "field" : {
          "protocol" : true
        }
        } ]
    }
  },
  "sort" : [ {
    "timeOffset" : {
      "order" : "asc" 
    }
  } ]
}'

Do you have an idea how to debug this problem? I removed the returned documents by hand (via curl -XDELETE) and then added them again, but it looks that somehow they manage to connect to the two protocolIds:

  • 121210250002027_1043_12_Kor-708_20160801_094045
  • 121210250002027_1043_12_Kor-716_20160516_080329

Maybe some kind of hashing returns the same values for the forementioned pair? Is that possible?
PS: I changed the ids I little bit, so these two are only the example.

It's because protocolId is an analyzed string field. Analyzed strings go through an analysis process which tokenizes the string into smaller tokens. Since you haven't specified an analyzer, it is using the default ("Standard Analyzer").

This analyzer splits strings on whitespace, newlines, carriage returns, punctuation and special characters. So those underscores and hyphens will be places to split.

Which means that both strings contain the tokens: ["121210250002027", "1043", "12", "Kor"]

You'll need to use a not_analyzed string if you want to compare them as single tokens.

You can read more about it here in the Guide: https://www.elastic.co/guide/en/elasticsearch/guide/current/_finding_exact_values.html#_term_query_with_text

Also, obligatory: holy wow, that's a really old version of Elasticsearch! You should really upgrade to a newer version :wink:

Thanks for the link. I have read the article and checked the results for the analyze command.

My protocolId consists of two logical parts. For 121210250002027_1043_12_Kor-708_20160801_094045 they are as follows:

  • 121210250002027_1043_12 - which is the exact case to which the protocols are assigned
  • Kor-708_20160801_094045 - which is the specific protocol

Like you said, the default analyzer splits it by the dash, so the resulting parts are:

  • 121210250002027_1043_12_Kor
  • 708_20160801_094045

It looks like the dash is the problem. If it appears, the annotations fetched come from all the protocols with the same protocolId begining.

Just for the record, I post the analyzer results.
The query

curl -XGET 'http://localhost:9100/1520/_analyze?pretty=true' -d '{ "field": "protocolId", "text": "121210250002027_1043_12_Kor-708_20160801_094045" }'

gives me

{
  "tokens" : [ {
    "token" : "field",
    "start_offset" : 3,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "protocolid",
    "start_offset" : 12,
    "end_offset" : 22,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "text",
    "start_offset" : 26,
    "end_offset" : 30,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "121210250002027_1043_12_kor",
    "start_offset" : 34,
    "end_offset" : 61,
    "type" : "<ALPHANUM>",
    "position" : 4
  }, {
    "token" : "708_20160801_094045",
    "start_offset" : 62,
    "end_offset" : 81,
    "type" : "<NUM>",
    "position" : 5
  } ]
}

And the second one

curl -XGET 'http://localhost:9100/1520/_analyze?pretty=true' -d '{ "field": "protocolId", "text": "121210250002027_1043_12_Kor-716_20160516_080329" }'

returns:

{
  "tokens" : [ {
    "token" : "field",
    "start_offset" : 3,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "protocolid",
    "start_offset" : 12,
    "end_offset" : 22,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "text",
    "start_offset" : 26,
    "end_offset" : 30,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "121210250002027_1043_12_kor",
    "start_offset" : 34,
    "end_offset" : 61,
    "type" : "<ALPHANUM>",
    "position" : 4
  }, {
    "token" : "716_20160516_080329",
    "start_offset" : 62,
    "end_offset" : 81,
    "type" : "<NUM>",
    "position" : 5
  } ]
}

I wonder how to check the results for the changed annotation mapping (with the corrected analyzer)?
There is a statement within the article you posted, that the index recreating is required, so in my case it would be the 1212 one. But I have millions of annotations like these, so it would be useful to have some kind of the emulator/debug view, to be sure for 100%, that It would help.

Is there a way to estimate how long it would take to recreate the index? I do not have to delete all the documents (annotations), do I?

Oops, you're right... underscores are not tokenized. :slight_smile:

I wonder how to check the results for the changed annotation mapping (with the corrected analyzer)?

You can use the Analyze API to test out new analysis combinations and see how they affect test data. For example, you could do:

curl -XGET 'localhost:9200/_analyze' -d '
{
  "tokenizer" : "keyword",
  "filters" : ["lowercase"],
  "text" : "121210250002027_1043_12_Kor-716_20160516_080329"
}'

Which produces a lowercased version of your id: 121210250002027_1043_12_kor-716_20160516_080329

And if you set the field to not_analyzed, no analysis takes place, so the token that is indexed is identical to the string that you provide.

There is a statement within the article you posted, that the index recreating is required,

Correct. It's not possible to change the analyzer of an existing field, since that would "break" all the existing indexed data. You need to either re-index into a new field with a new analyzer, or reindex into a new index (and if you want to keep the same index name, you need to delete the existing index first)

This is an old article, but the tips should still apply: Changing Mapping with Zero Downtime | Elastic Blog

Is there a way to estimate how long it would take to recreate the index? I do not have to delete all the documents (annotations), do I?

Not really, no. Reindexing is the same as indexing, so it depends on how fast your cluster normally indexes the data. It all depends on hardware, how many documents, complexity and size of your docs, etc.