Following your hypothesis that it's to do with the index size, if you delete enough documents (from the copy, obviously!) does the problematic document eventually appear in the search results? If so, how small did the index get before it started working again?
After deleting documents for awhile, the search query managed to return the "missing" document when the document count is 552204 with total size of 57.2gb (down from 810898 with total size of 77.5gb). Note: I stopped the script at some point and re-run it so I suppose that's why it's not 552898 or similar.
Here's the script that I used to do the deletion if it's relevant somehow. I first collect the document IDs for the first 1000 documents for deletion later. Then I run the search query to see if it can find the document. If it doesn't find it, I delete the document IDs that were collected.
#!/bin/sh
HOST=localhost
while true; do
ids=$(curl "http://$HOST:9200/lindex7v3-parsed-copy/_search?size=1000&_source=null&pretty&sort=_doc" | rg '\s+"_id" : "(.+)".*' -r '$1')
curl "http://$HOST:9200/lindex7v3-parsed-copy/_search?size=1000&sort=_doc" -H 'Content-Type: application/json' -d '
{
"query": {
"match_phrase_prefix": {
"email": {
"query": "f-secure"
}
}
},
"size": 40,
"_source": false
}
' | rg 'fca3c9c3fa7c480d372ca6d479f3d42ee14774eb9dd0c86e233ffefbb9729448'
if [ $? -eq 0 ]; then
echo 'Found it!'
break
fi
for id in $ids; do
if [ $id = 'fca3c9c3fa7c480d372ca6d479f3d42ee14774eb9dd0c86e233ffefbb9729448' ]; then
continue
fi
curl -XDELETE "http://$HOST:9200/lindex7v3-parsed-copy/_doc/$id"
done
done
I installed rg tool for capturing the document IDs (perhaps it can be done with grep as well). I set the size to 40 because there are only 38 documents in total (including the "missing" one).
Perhaps I should mention also that this index had been reindexed from ES2 to ES5 and then to ES6 (which is this one) if it provides any insight.
The script did seem to have deleted 4 relevant documents in the process. The search result returned 37 (excluding the "missing" one) and now it returns 34 (including the "missing" one).
Even more curious. I was hoping that the transition might have happened when the index was smaller; 57GB is still too large to be useful for a reproduction.
I have some ideas for next steps that I'm discussing with colleagues. In the meantime could you make sure you have a snapshot of the original index so that you can get back into the broken state again. It would be disappointing if we lost the ability to reproduce this.
Ok, I'm glad I asked around. It turns out this is a documented quirk of the match phrase prefix query that you are using. The default for max_expansions is 50. If I understand correctly, when the document you seek does not contain one of the first 50 terms with the given prefix (in its segment) it will not match. As your index gets larger the chances of this happening will increase. Apparently GET _validate/query?rewrite&explain will allow you to see the terms in question.
Your question has triggered an internal discussion on ways to make this less trappy for future users.
I see, I understand now. Once I increased max_expansions to 10000, it did find the document.
Is it right to say that I can use the prefix query as a replacement for match_phrase_query when there is no need for term order? But it doesn't seem to match anything when I use the same keyword I used for match_phrase_query:
Indeed, with your analyzer config the string f-secure yields the terms f and secure, neither of which has the prefix f-secure. Unfortunately I'm not the right person to talk to about analyzing email addresses for the kinds of search you want here. I would suggest starting a new thread on that subject since this one is rather long and unlikely to attract the attention of an analysis expert.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.