Search doesn't find the document that it should find when index is big?

Hi, I have a problem with Elasticsearch 6.8 (I also tried with 5.6 and 2.4 since we are upgrading) where it can't find the document that I want when the index has over 800K documents but when I create another index with just the one document in it, it can find it. Does anyone know the reason why? The query, settings, and mappings for both indexes are the same.

No reason.

Just a question. Do you index the document and just after search for it?
What exact query are you running ?

Hi @dadoonet,

Yes, I indexed the document to my other index and searched it and it was found. It couldn't find it on the original index though. I've tried with deprecated Indices query version as well (but on ES5) with the same result. This query is my translation from Indices query to a replacement query:

curl 'http://localhost:9200/index_a,index_b/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "must": [
              {
                "match_phrase_prefix": {
                  "email_addresses": {
                    "query": "f-secure"
                  }
                }
              }
            ],
            "filter": [
              {
                "terms": {
                  "_index": [
                    "index_a"
                  ]
                }
              }
            ]
          }
        },
        {
          "bool": {
            "must": [
              {
                "match_phrase_prefix": {
                  "email": {
                    "query": "f-secure"
                  }
                }
              }
            ],
            "must_not": [
              {
                "terms": {
                  "_index": [
                    "index_a"
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "size": 100,
}

I replaced the index_b here with the new index I created with the single document and it was found.

I noticed there was "num_reduce_phases": 2 on my Elasticsearch 5.6 instance which I couldn't find much documentation about. I wonder if that has anything to do with it being left out in Elasticsearch 6.8?

Could you check if the mapping is strictly the same? Specifically for the field email.

Why are you searching in 2 indices and adding this terms filter on _index field? With one clause which excludes index index_a and the other which includes it.

Hi @dadoonet,

Yeah, both settings and the mapping are exactly the same other than the index name. Even with analyzer API, I found that both indexes' analyzers produce the same result for the email field.

I'm trying to re-create the Indices query because I need to run the query on two different indices with different settings and mappings. So I use only index_a for the first query clause and index_b for the second clause by excluding index_a since I specified both indexes on the search URL.

I see.

Is there a way to reproduce the case? At least, could you send a recreation script that you are using for index_b? Not sure it will tell anything but may be...

Hmm I used a custom TUI tool for creating the test index. Could I message you the settings and mappings along with the one document?

I should mention that it found all the other documents on index_b. It just misses that one document which I found weird because it can find it on the test index.

Oh and there is one thing I noticed. The value for email field in that particular document is the only one with different format abc@m4.f-secure.co.jp while on the other documents are like abc@f-secure.com. However, since it can found on the test index I thought it's irrelevant but it might be a hint to something.

I suspect an analysis problem/mapping problem.

What is the full output of the analyze api call on both indices? And what is the API call itself?

For the original index, I used this:

curl http://localhost:9200/index_b/_analyze -H 'Content-Type: application/json' -d '
{
  "text": "nao-6@m4.f-secure.co.jp",
  "analyzer": "email"
}
'

which produced:

{
  "tokens": [
    {
      "token": "nao",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "6",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "m4",
      "start_offset": 6,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "f",
      "start_offset": 9,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "secure",
      "start_offset": 11,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "co",
      "start_offset": 18,
      "end_offset": 20,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "jp",
      "start_offset": 21,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 6
    }
  ]
}

and for the test index, I used:

curl http://localhost:9200/test/_analyze -H 'Content-Type: application/json' -d '
{
  "text": "nao-6@m4.f-secure.co.jp",
  "analyzer": "email"
}
'

which produced:

{
  "tokens": [
    {
      "token": "nao",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "6",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "m4",
      "start_offset": 6,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "f",
      "start_offset": 9,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "secure",
      "start_offset": 11,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "co",
      "start_offset": 18,
      "end_offset": 20,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "jp",
      "start_offset": 21,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 6
    }
  ]
}

I wonder if the number of documents difference and the fact that the value is an outlier made the score for the original index's document so low that it's not included? I don't know if that's possible.

I'm also running the Elasticsearch instance with around 100G of data on server with 8G of ram (half of it for heap) if that's any help.

Also, the email field is an array of string.

Could you try with:

GET /index_b/_analyze
{
  "field" : "email",
  "text" : " abc@m4.f-secure.co.jp"
}
GET /index_a/_analyze
{
  "field" : "email",
  "text" : " abc@m4.f-secure.co.jp"
}

Hi @dadoonet,

Here's the result for index_b:

{
  "tokens": [
    {
      "token": "abc",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "m4",
      "start_offset": 4,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "f",
      "start_offset": 7,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "secure.co.jp",
      "start_offset": 9,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

and this is the result for index_a:

{
  "tokens": [
    {
      "token": "abc@m4.f-secure.co.jp",
      "start_offset": 0,
      "end_offset": 21,
      "type": "word",
      "position": 0
    },
    {
      "token": "abc",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "m",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 1
    },
    {
      "token": "4",
      "start_offset": 5,
      "end_offset": 6,
      "type": "word",
      "position": 2
    },
    {
      "token": "f",
      "start_offset": 7,
      "end_offset": 8,
      "type": "word",
      "position": 3
    },
    {
      "token": "secure",
      "start_offset": 9,
      "end_offset": 15,
      "type": "word",
      "position": 4
    },
    {
      "token": "co",
      "start_offset": 16,
      "end_offset": 18,
      "type": "word",
      "position": 5
    },
    {
      "token": "jp",
      "start_offset": 19,
      "end_offset": 21,
      "type": "word",
      "position": 6
    }
  ]
}

The email field for index_a is called email_addresses though so if you want the output of that as well, here it is:

{
  "tokens": [
    {
      "token": "abc",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "m4",
      "start_offset": 4,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "f",
      "start_offset": 7,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "secure.co.jp",
      "start_offset": 9,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A way to create the index with its mapping and analyzer for an email field, a document to be indexed and the search query that works on index_b but not on index_a.

Hmm sorry if I was not clear but the issue is that the document can be found on test index via search but not on index_b which has exact same settings and mappings as test index. There is no issue on index_a (there are no documents that are missing from the search).

test index contains just the 1 document (the one that can't be found via search from index_b) while index_b contains over 800K documents which is over 70G in size.

So index_b is the problematic index where I cannot search for the document. After I create a duplicate index based on index_b (called test), I can find it on test index, though it contained only 1 document as opposed to 800K.

I can't put the script here as the document is too big. I have uploaded it and shared with you on Message.

Sorry, I made a mistake somewhere. I'll re-upload the script.

EDIT: Never mind, I think there was no mistake.

If you'd like, I could also record a video or perhaps even talk over video chat? It's much easier to demonstrate the issue that way.

EDIT: I forgot the query but it's basically the one above:

GET index_a,test/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "must": [
              {
                "match_phrase_prefix": {
                  "email_addresses": {
                    "query": "f-secure"
                  }
                }
              }
            ],
            "filter": [
              {
                "terms": {
                  "_index": [
                    "index_a"
                  ]
                }
              }
            ]
          }
        },
        {
          "bool": {
            "must": [
              {
                "match_phrase_prefix": {
                  "email": {
                    "query": "f-secure"
                  }
                }
              }
            ],
            "must_not": [
              {
                "terms": {
                  "_index": [
                    "index_a"
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "size": 100,
}

EDIT: Actually, I found with just a simplified query it's still the same:

GET test/_search
{
  "query": {
    "match_phrase_prefix": {
      "email": {
        "query": "f-secure"
      }
    }
  }
}

It can find the document on the test index but not on the index_b with 800K documents.

Hi @dadoonet, do you have idea what causes this issue? Is there any reason other than the settings and mapping? Or should I file a bug?

I think we still haven't ruled out that there is some difference in the mappings between your test index and your index_b. You haven't yet shared any evidence that the mappings really are identical; mappings can be quite large documents and a very subtle difference could be enough to explain what you're seeing.

Another possible explanation is that the index in question doesn't actually contain the document you're trying to find, or maybe the document was only recently added and hasn't been exposed to searches with a refresh yet.

There may be more explanations too. I do not think a bug report is appropriate until we have ruled out all the other possibilities. In any case a bug report only has value if it contains sufficient detail for us to reproduce the problem. Right now we haven't got enough detail to do that so we can't see what you're seeing.

1 Like

Hi @DavidTurner, I would like to share with you the settings and mapping but is there a way for me to share it with you via a secure channel? Is that the only evidence you need? The document definitely exists in both indexes.

Actually, here's the settings for both indexes:

The settings for index that doesn't return the document:

{
  "lindex7v3-parsed": {
    "settings": {
      "index": {
        "search": {
          "slowlog": {
            "threshold": {
              "fetch": {
                "warn": "100ms",
                "debug": "100ms"
              },
              "query": {
                "warn": "100ms",
                "debug": "100ms"
              }
            }
          }
        },
        "refresh_interval": "1s",
        "indexing": {
          "slowlog": {
            "threshold": {
              "index": {
                "warn": "100ms",
                "debug": "100ms"
              }
            }
          }
        },
        "number_of_shards": "1",
        "provided_name": "lindex7v3-parsed",
        "creation_date": "1578470371412",
        "analysis": {
          "analyzer": {
            "email": {
              "filter": [
                "lowercase",
                "unique"
              ],
              "char_filter": [
                "period_replace"
              ],
              "tokenizer": "standard"
            },
            "text": {
              "filter": [
                "lowercase",
                "unique"
              ],
              "char_filter": [
                "colon_replace"
              ],
              "tokenizer": "standard"
            }
          },
          "char_filter": {
            "period_replace": {
              "pattern": "\\.",
              "type": "pattern_replace",
              "replacement": " "
            },
            "colon_replace": {
              "pattern": "\\:",
              "type": "pattern_replace",
              "replacement": " "
            }
          }
        },
        "number_of_replicas": "0",
        "uuid": "OXY1ktvWQ6i7SJo5IdKBUA",
        "version": {
          "created": "6080699"
        }
      }
    }
  }
}

And this settings is for the index that does return the document:

{
  "test": {
    "settings": {
      "index": {
        "search": {
          "slowlog": {
            "threshold": {
              "fetch": {
                "warn": "100ms",
                "debug": "100ms"
              },
              "query": {
                "warn": "100ms",
                "debug": "100ms"
              }
            }
          }
        },
        "refresh_interval": "1s",
        "indexing": {
          "slowlog": {
            "threshold": {
              "index": {
                "warn": "100ms",
                "debug": "100ms"
              }
            }
          }
        },
        "number_of_shards": "1",
        "provided_name": "test",
        "creation_date": "1579427833166",
        "analysis": {
          "analyzer": {
            "email": {
              "filter": [
                "lowercase",
                "unique"
              ],
              "char_filter": [
                "period_replace"
              ],
              "tokenizer": "standard"
            },
            "text": {
              "filter": [
                "lowercase",
                "unique"
              ],
              "char_filter": [
                "colon_replace"
              ],
              "tokenizer": "standard"
            }
          },
          "char_filter": {
            "period_replace": {
              "pattern": "\\.",
              "type": "pattern_replace",
              "replacement": " "
            },
            "colon_replace": {
              "pattern": "\\:",
              "type": "pattern_replace",
              "replacement": " "
            }
          }
        },
        "number_of_replicas": "0",
        "uuid": "AkYUV0JFTQyEOTL0yQOfwg",
        "version": {
          "created": "6080699"
        }
      }
    }
  }
}

And this is just the "email" part of the mappings for both indexes:

This mapping is for index that doesn't return the document:

"email": {
            "type": "text",
            "fields": {
              "raw": {
                "type": "text",
                "analyzer": "email"
              }
            }
          },

This mapping is for index that does return the document:

"email": {
            "type": "text",
            "fields": {
              "raw": {
                "type": "text",
                "analyzer": "email"
              }
            }
          },

The queries I used are the same which are these:

This doesn't find the document:

curl "http://localhost:9200/lindex7v3-parsed/_search?pretty" -H 'Content-Type: application/json' -d '
{
  "query": {
    "match_phrase_prefix": {
      "email": {
        "query": "f-secure"
      }
    }
  },
  "size": 40,
  "_source": "parsed_id"
}
'

This does:

curl "http://localhost:9200/test/_search?pretty" -H 'Content-Type: application/json' -d '
{
  "query": {
    "match_phrase_prefix": {
      "email": {
        "query": "f-secure"
      }
    }
  },
  "size": 40,
  "_source": "parsed_id"
}
'

The real index has been around for a few weeks now and it can't find the document. The test index I created just now can find it. The document on both indexes have the field email and contains the value it found.

Is there anything else that could have made the search work differently between the two indexes?

Thanks for the detail, that's very helpful and seems to rule out quite a few things. If you make a copy of this index by reindexing it does your search still fail on the copy?

Hi @DavidTurner,

Do you mean to reindex the failing, original index or the successful, test index? I created the test index by first creating the index setting and mapping then fetch the document and index it to the test index. Does that work differently?

I will reindex the failing index to a copy index and let you know how it goes. It will take a while because it's relatively big.

I mean to reindex the large index on which the search is failing.

Hi @DavidTurner,

I have just finished the reindexing. The search result is the same for the copy. It didn't find the document that was missing but found the others, just like the original index.

The copy index settings:

{
  "lindex7v3-parsed-copy": {
    "settings": {
      "index": {
        "search": {
          "slowlog": {
            "threshold": {
              "fetch": {
                "warn": "100ms",
                "debug": "100ms"
              },
              "query": {
                "warn": "100ms",
                "debug": "100ms"
              }
            }
          }
        },
        "refresh_interval": "1s",
        "indexing": {
          "slowlog": {
            "threshold": {
              "index": {
                "warn": "100ms",
                "debug": "100ms"
              }
            }
          }
        },
        "number_of_shards": "1",
        "provided_name": "lindex7v3-parsed-copy",
        "creation_date": "1579431168726",
        "analysis": {
          "analyzer": {
            "email": {
              "filter": [
                "lowercase",
                "unique"
              ],
              "char_filter": [
                "period_replace"
              ],
              "tokenizer": "standard"
            },
            "text": {
              "filter": [
                "lowercase",
                "unique"
              ],
              "char_filter": [
                "colon_replace"
              ],
              "tokenizer": "standard"
            }
          },
          "char_filter": {
            "period_replace": {
              "pattern": "\\.",
              "type": "pattern_replace",
              "replacement": " "
            },
            "colon_replace": {
              "pattern": "\\:",
              "type": "pattern_replace",
              "replacement": " "
            }
          }
        },
        "number_of_replicas": "0",
        "uuid": "oReLNwrYQVSYy6qLa-wPkw",
        "version": {
          "created": "6080699"
        }
      }
    }
  }
}

The email field mapping:

"email": {
            "type": "text",
            "fields": {
              "raw": {
                "type": "text",
                "analyzer": "email"
              }
            }
          },

The search query:

curl "http://localhost:9200/lindex7v3-parsed-copy/_search?pretty" -H 'Content-Type: application/json' -d '
{
  "query": {
    "match_phrase_prefix": {
      "email": {
        "query": "f-secure"
      }
    }
  },
  "size": 40,
  "_source": "parsed_id"
}
'