First post here
I work on plain-text search (of metadata) at the Internet Archive. I am in the process of testing 2.x for migration from our existing production 1.7.x cluster, to 2.x (now on 2.2.1, not yet 2.3).
I recently built a new index from scratch for our 2.x test cluster that should be more or less identical (in terms of documents, mapping, etc.) to our 1.7.x cluster which is in production.
I have been looking for and analyzing any differences in responses to our query stream.
Mostly things are very comparable (though we are being bitten by the 'no dots in field names' issue, which affects a smaller % of our documents, which have open-ended schema for user uploaded content).
But I'm posting as I've been trying to debug a set of queries in which result sets differ drastically between 1.x and 2.x.
To make a long story short, AFAI can tell, there has been some change, somewhere, in the determination of exact matches specific to the multi-word case in right-to-left languages.
In particular: as far as I can tell, query string queries which use the Lucene syntax of double-quotes to scope to an exact match, e.g.
subject:"ايقاع خالية", which worked in 1.x, no longer work in 2.x for left-to-right languages.
I believe the problem is unique to RTL languages but have not tested widely yet. Multi-word exact matches in English however do still work, e.g.
subject:"search debugging" matches as expected.
And, single-word (non-tokenized, that is) exact matches work in both English and Arabic.
Worth noting that there is no difference in the analyses or tokenization being done–we have highly variable metadata so just use default tokenizers on these fields which have open-ended, user-defined values.
_analyze results for the Arabic field are hence identical in both 1.x and 2.x...( determining this gave me a chance to learn the new syntax for the endpoint in 2.x... :))
Anyway.... I am hoping someone can tell me if/when a change was made in Lucene or ES (or even Java string processing of URL query parameters or POST body?) which explains the difference. And maybe knows whether this is a bug, or, there is a requirement for using special tokenization (or something) for whitespace for LTR languages? (Which would be a problem for us, as the fields in question are open. :/)
I suspect that the issue has something to do with the handling of the whitespace word break interacting with 'left to right' Unicode special control characters, or, how tokens are enumerated for LTR vs RTL languages... but that's idle speculation...
Fwiw various attempts to hack around the problem at query construction time have not availed so far. I tried inverting the words for example in the search query; and, using the Lucene 'proximity' control (
"foo bar"~20000) with imperfect results.
My goal is to figure out how to work around this... (right now, I suspect I may have no choice but to reindex all multi-word fields in RTL languages, using an alternate word delimiter character, or something... )
Any tips or pointers infinitely appreciated! I've come up with nothing poking through Lucene and ES changelogs and the like.