Change in ES 2.x (vs 1.x) for multiword exact match, for right-to-left languages (Arabic)?

Hi,

First post here :slight_smile:

I work on plain-text search (of metadata) at the Internet Archive. I am in the process of testing 2.x for migration from our existing production 1.7.x cluster, to 2.x (now on 2.2.1, not yet 2.3).

I recently built a new index from scratch for our 2.x test cluster that should be more or less identical (in terms of documents, mapping, etc.) to our 1.7.x cluster which is in production.

I have been looking for and analyzing any differences in responses to our query stream.

Mostly things are very comparable (though we are being bitten by the 'no dots in field names' issue, which affects a smaller % of our documents, which have open-ended schema for user uploaded content).

But I'm posting as I've been trying to debug a set of queries in which result sets differ drastically between 1.x and 2.x.

To make a long story short, AFAI can tell, there has been some change, somewhere, in the determination of exact matches specific to the multi-word case in right-to-left languages.

In particular: as far as I can tell, query string queries which use the Lucene syntax of double-quotes to scope to an exact match, e.g.subject:"ايقاع خالية", which worked in 1.x, no longer work in 2.x for left-to-right languages.

I believe the problem is unique to RTL languages but have not tested widely yet. Multi-word exact matches in English however do still work, e.g. subject:"search debugging" matches as expected.

And, single-word (non-tokenized, that is) exact matches work in both English and Arabic.

Worth noting that there is no difference in the analyses or tokenization being done–we have highly variable metadata so just use default tokenizers on these fields which have open-ended, user-defined values.

The _analyze results for the Arabic field are hence identical in both 1.x and 2.x...( determining this gave me a chance to learn the new syntax for the endpoint in 2.x... :))

Anyway.... I am hoping someone can tell me if/when a change was made in Lucene or ES (or even Java string processing of URL query parameters or POST body?) which explains the difference. And maybe knows whether this is a bug, or, there is a requirement for using special tokenization (or something) for whitespace for LTR languages? (Which would be a problem for us, as the fields in question are open. :/)

I suspect that the issue has something to do with the handling of the whitespace word break interacting with 'left to right' Unicode special control characters, or, how tokens are enumerated for LTR vs RTL languages... but that's idle speculation...

Fwiw various attempts to hack around the problem at query construction time have not availed so far. I tried inverting the words for example in the search query; and, using the Lucene 'proximity' control ("foo bar"~20000) with imperfect results.

My goal is to figure out how to work around this... (right now, I suspect I may have no choice but to reindex all multi-word fields in RTL languages, using an alternate word delimiter character, or something... :frowning:)

Any tips or pointers infinitely appreciated! I've come up with nothing poking through Lucene and ES changelogs and the like.

best,
aaron

This is very interesting. I don't see what analyzers/tokenizers you use for the broken case with LTR (arabic) languages, can you give some more information? Is it the standard arabic analyzer and the standard tokenizer?

Hi Jörg,

Thanks for the interest!

The index I'm working with is essentially a plain text card catalog of metadata. Documents hold metadata in many languages. A few fields are dates, numeric types, etc., but most are analyzed strings using very simple analyzers.

The field in question (subject) like all the text fields is used (has been used) for both English and non-English terms. So we just use language-agnostic ES defaults, with only minor modifications like an underscore character filter.

Our analyzer is named textBar; here is the definition, which is identical in the v1 and v2 indices:

   "analysis": {
      "char_filter": {
        "underscore_to_space": {
          "type": "mapping",
          "mappings": [
            "_=>\\u0020"
          ]
        }
      },
      "analyzer": {
        "textBar": {
          "type": "custom",
          "char_filter": [
            "underscore_to_space"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ],
          "tokenizer": "standard"
        }

With the ES v2.2.1 index that is not matching,

GET my_v2_index/_analyze
{
    "field":"subject",
    "text":"ايقاع خالي"
 }

yields

{
   "tokens": [
   {
       "token": "ايقاع",
       "start_offset": 0,
       "end_offset": 5,
       "type": "<ALPHANUM>",
       "position": 0
   },
   {
      "token": "خالي",
      "start_offset": 6,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

For comparison,

with the ES v1.7.3 index which is matching,

GET my-v1-index/_analyze?field=subject&text=ايقاع خالية

yields

{
   "tokens": [
      {
         "token": "ايقاع",
         "start_offset": 0,
         "end_offset": 5,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "خالية",
         "start_offset": 6,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

The only difference in analysis is the position enumeration, which as you can see is from 0 in 1.7.3, and from 1 in 2.2..1...

...but I verified this change also applies to English language multiword phrases (.e.g. search debugging) and those do work in v2. :confused:

I will see if I can distill this down to something reproducible... :slight_smile:

best regards,
aaron

Arabic is RTL not LTR.

Please see this to add the arabic analyser instead of default analyser. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#arabic-analyzer

Sarwar

Hi Sarwar,

This field is used for values across many languages (CJK, many european languages, etc); we cannot use custom analyzers or establish different fields (for many reasons).

Worth reiterating, the simple analyzer in use worked fine in ES v1.x.

I am merely trying to debug whether there is an undocumented change in behavior using the same settings in ES v.2.2.1... etc. :confused:

And sorry, yes, you are totally right! I have been saying LTR when I mean RTL throughout, erp. :blush: [fixed above!]

best regards,
aaron

I can not reproduce the issue.

In this gist

I get a hit with both ES 1.5.2 and ES 2.2.1. Maybe I missed something?

Thanks for that!

The mystery deepens, following your gist, for your match query I get different results(!):

{
 "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

with GET /test/doc/1 showing

{
  "_index": "test",
  "_type": "doc",
  "_id": "1",
  "_version": 1,
  "found": true,
  "_source": {
    "message": "ايقاع خالي"
  }
}

Yet:

POST /test/doc/_search
{
    "query" : {
        "match" : {
            "message" : "ايقاع"
        }
    }
}

gives

{
  "took": 18,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.19178301,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "1",
        "_score": 0.19178301,
        "_source": {
          "message": "ايقاع خالي"
        }
      }
    ]
  }
}

Sigh, yet this does work:

POST /test/doc/_search
{
    "query" : {
        "query_string" :
           { "query": "message:\"ايقاع خالي\"" }
    }
}

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.38356602,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "1",
        "_score": 0.38356602,
        "_source": {
          "message": "ايقاع خالي"
        }
      }
    ]
  }
}

Hmmm, copying and pasting my original field value does reproduce my failure:

PUT /test/doc/2
{
    "message" : "ايقاع خاليي"
}

POST /test/doc/_search
{
    "query" : {
        "query_string" :
           { "query": "message:\"ايقاع خاليي\"" }
    }
}

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

There is one difference of character, but I have to believe the issue is actually in the space/direction characters...

Ugh I am not familiar with how/when Unicode RTL and LTR control characters are maintained or displayed, I wonder if the issue is that in my source metadata there is a different decoration with these characters? :confused:

Invisible magic... ...

Going to tinker a bit!

Thank you for your help Jörg!

best regards,
aaron

Just noticed something, the display of those characters is different in my Sense than as rendered here on the forum posts.

On the posts, "my" version and yours look identical.

In my Sense (and original document) there is a difference in the word خاليي.

In fact just pasting it there in line looks different, I don't think it's a font difference (though I don't know Arabic!), I am suddenly wondering if this is actually a Unicode issue.

IIRC there are multiple ways of constructing characters in scripts which use diacriticals(?) to show vowels as 'decorations' on consonants ... I am more familiar with this in Indian scripts... but I think Hebrew and Arabic also do this...

...so maybe my problem is actually that there is a difference in the 'folding' (coercion?) of one representation into another, in document ingestion..

...and the match fails because what is stored is a coerced simplified form, or, vice versa, during query time somewhere that is occurring...

Rambling... forgive me....

Hint: you can use Java Unicode notation in storing documents,, or querying, like this

PUT /test/doc/2
{
    "message" : "\u0671"
}
POST /test/_search
{
    "query" : {
        "match" : {
            "message" : "\u0671"
        }
    }
}

This is more reproducible and maybe this can shed some light.

I agree that copy/paste Unicode characters over the desktop and the browser can be messy. Using a Mac, it is not that serious, but anyway, it may always be an issue.

For _analyze in ES 1.x, the only method to pass safely characters to the analyzer test is by using URI escaping, which is another story,

If I had to guess, it looks like in my original field at issue there is a duplicated vowel annotation(?) which perhaps is being scrubbed in an update parser. If I copy and paste the relevant (right most) word in my test string, within Sense, I can match...

...if I copy and paste out of my own post, which was pasted in... it does not match.

Oi.

Looking at https://www.elastic.co/guide/en/elasticsearch/guide/current/character-folding.html

I will try making a test index, and indexing relevant documents using that instead of asciifolding and report my results. :slight_smile:

I also don't know Arabic, but there are combining diacritics like e.g. "wasla" \u0671 http://www.unicode.org/L2/L2003/03166-wasla.pdf

As Wikipedia states, there is a difference in handling "wasla" character since Unicode version 6. So, if your OS and Elasticsearch disagree about the Unicode version, the result might get messy.

BTW Java 7 uses Unicode 6.0 and Java 8 uses Unicode 6.2, so you can also get a different result just by switching JRE.

Oh lordy.

Installing the v2 cluster I switched to the ES-distributed Ansible deployment role, from one I inherited and have done a lot of customization on...

...didn't think to check whether the Java being installed differed.

Cluster running 1.7:

java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)

Cluster running 2.2:

java version "1.7.0_95"
OpenJDK Runtime Environment (IcedTea 2.6.4) (7u95-2.6.4-0ubuntu0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)

The indexing code path itself is identical (and both are on the same 'trusty' Ubuntu) but I wonder if differences are in Java Unicode handling, say, in how non-conforming uses of combining characters are handled.

puts head down on desk and dreams with an ache in his heart of the golden forgotten days of 7-bit ASCII

OK, I think I might have it.

The RTL / Unicode aspect is I think a red herring :pensive:

I get different results for v1 and v2 for exact matches against sequential tokenized terms in multivalued fields.

Staying in Jörg's test index from his gist,

PUT /test/doc/3
{
    "message" : [ "foo","bar","baz" ]
}

then

POST /test/doc/_search
{
    "query" : {
        "query_string" :
           { "query": "message:\"foo bar\"" }
    }
}

POST /test/doc/_search
{
    "query" : {
        "query_string" :
           { "query": "message:\"foo bar baz\"" }
    }
}

POST /test/doc/_search
{
    "query" : {
        "query_string" :
           { "query": "message:\"bar baz\"" }
    }
}

All match doc 3 in v1, but not v2

POST /test/doc/_search
{
    "query" : {
        "query_string" :
           { "query": "message:\"foo baz\"" }
    }
}

does not match in either (thankfully...).

Important footnote: the DSL match provides equivalent results in both v1 and v2: multivalued terms are matched in both, e.g. this matches in both:

POST /test/doc/_search
{
    "query" : {
        "match" : {
            "message" : "foo bar baz"
        }
    }
}

TL;DR: the difference in behavior I am seeing is not related to RTL languages.

Instead, it's how multi-valued terms interact with exposure of the Lucene 'exact match' through the Query String query, or, in my actual application, its equivalent in Lucene syntax in URI Search, e.g. curl -XGET 'my-es-host:9200/test/doc/_search?q=message:"foo+bar+baz"'

It seems my users have been getting matches when querying for exact phrases like "search debugging" whenever it happened that the terms search and debugging were sequential in an array of multivalued terms for the target field.

...and that 2.x has (I am guessing by design) a more nuanced use of proximity when indexing multivalue fields. Indeed I have the vaguest recollection of hearing this mentioned in passing at Elasticon? That could be confabulation...

If this analysis is correct, it seems the behavior of 2.x is more correct (specifically for the handling of Lucene exact match)... but my users might be surprised to find their results sparser :open_mouth:

Thank you Jörg for all of your helpful suggestions! I was walking through the actual stored documents examining the UTF byte encodings within the document field values and comparing them to those of the search terms, when I discovered (initially in Arabic) that I was matching phrases where only word appeared in a particular document's field values... !

best regards,
aaron

EDIT: the behavioral difference is also apparent via:

POST /test/doc/_search
{
    "query" : {
        "match_phrase" : {
            "message" : "foo bar baz"
        }
    }
}

which as you might expect from the results above, matches in v1 but not in v2. I haven't looked with the profiler yet but I am assuming match_phrase and the Lucene use of double quotes for exact match are executed as the same operations...

EDIT 2: In hindsight, could have predicted this from https://www.elastic.co/guide/en/elasticsearch/guide/current/_multivalue_fields_2.html

I'm guessing the difference I see is simply that the position_offset_gap default has changed between v1 and v2 ...

...yep: https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking_20_mapping_changes.html expressly states,

The position_offset_gap option is renamed to position_increment_gap. This was done to clear away the confusion. Elasticsearch’s position_increment_gap now is mapped directly to Lucene’s position_increment_gap

The default position_increment_gap is now 100. Indexes created in Elasticsearch 2.0.0 will default to using 100 and indexes created before that will continue to use the old default of 0. This was done to prevent phrase queries from matching across different values of the same term unexpectedly. Specifically, 100 was chosen to cause phrase queries with slops up to 99 to match only within a single value of a field.

Today I Learned... the hard way.

We all do. I learned many years ago the annoyances of query_string and since then, I only use match or simple_query_string.

Sigh, yes... at the moment we are bound by a publicly exposed API which accepts Lucene syntax queries which were originally directed to Solr..

...while we are evolving internal services to DSL, many still assume they can use that same endpoint with complex Lucene/Solr syntax queries, which are in fact still routed through the URI Search API .

We'll get there... we are just in a turning-the-boat-while-underway sort of legacy situation. :slight_smile: