Operators not detected in analysis

ivanqwam · May 14, 2018, 12:49pm

When running this query, as a query_string :

voyagé AND en AND Italie

finds nothing, despite several documents containing "Nous avons voyagé en Italie". (forget the discussion about writing queries, this what our users ask )

Analyzing the query with the analyzer debugger, it seems that the AND operators are not detected and stay considered as regular words.
Even the standard analyzer behaves like this.

Here is my query:

curl  -H'Content-Type: application/json' "$URL/$INDEX/_search?size=10&pretty=true" -d '{ 
  "query": { 
 "query_string": { 
      "default_field": "tout", 
      "default_operator": "AND", 
      "query" : "tout:(voyagé AND en AND Italie)" 
 } 
  } 
}'

Here is our index creation with our a french_analyzer:

curl -H'Content-Type: application/json' -XPUT "$URL/$INDEX" -d'{
  "settings": {
"index":    {

  "number_of_shards" : 10, 
  "number_of_replicas" : 1, 

  "analysis": {

    "filter": {

      "french_elision": {
        "type":         "elision",
        "articles_case": true,
        "articles": [
            "l", "m", "t", "qu", "n", "s",
            "j", "d", "c", "jusqu", "quoiqu",
            "lorsqu", "puisqu"
          ]
      },

      "french_stop": {
        "type":       "stop",
        "stopwords":  "_french_" 
      },


      "delimiter": {
        "type":       "word_delimiter"
      },

      "accent": {
        "type":       "asciifolding"
      },

      "french_keywords": {
        "type":       "keyword_marker",
        "keywords":   ["exemple"] 
      },

      "french_stemmer": {
        "type":       "stemmer",
        "language":   "light_french"
      }
    },

    "char_filter": {
      "myhtml_filter": {
        "type": "html_strip",
        "escaped_tags": ["strong"]
      }
    },

    "analyzer": {

      "french_analyzer": {
        "tokenizer":  "standard",
        "filter": [
          "delimiter",
          "french_elision",
          "accent",
          "lowercase",
          "french_stop",
          "french_keywords",
          "french_stemmer"
        ],
        "char_filter": ["myhtml_filter"]
      }

    }

   }
}
   }
 }
}'

When performing analysis with any of the analyzer:

curl  -H'Content-Type: application/json' "$URL/$INDEX/_analyze?pretty=true" -d '{
  "explain": "true",
  "analyzer": "french_analyzer",
  "text": "voyagé AND en AND Italie"
}'

We get:

{
  "tokens" : [ ],
  "detail" : {
"custom_analyzer" : true,
"charfilters" : [
  {
    "name" : "myhtml_filter",
    "filtered_text" : [
      "voyagé AND en AND Italie"
    ]
  }
],
"tokenizer" : {
  "name" : "standard",
  "tokens" : [
    {
      "token" : "voyagé",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0,
      "positionLength" : 1,
      "bytes" : "[76 6f 79 61 67 c3 a9]",
      "termFrequency" : 1
    },
    {
      **"token" : "AND",**
**          "start_offset" : 7,**
**          "end_offset" : 10,**
**          "type" : "<ALPHANUM>",**
**          "position" : 1,**
**          "positionLength" : 1,**
**          "bytes" : "[41 4e 44]",**
**          "termFrequency" : 1**
    },
    {
      "token" : "en",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2,
      "positionLength" : 1,
      "bytes" : "[65 6e]",
      "termFrequency" : 1
    },
    {
      **"token" : "AND",**
**          "start_offset" : 14,**
**          "end_offset" : 17,**
**          "type" : "<ALPHANUM>",**
**          "position" : 3,**
**          "positionLength" : 1,**
**          "bytes" : "[41 4e 44]",**
**          "termFrequency" : 1**
    },
    {
      "token" : "Italie",
      "start_offset" : 18,
      "end_offset" : 24,
      "type" : "<ALPHANUM>",
      "position" : 4,
      "positionLength" : 1,
      "bytes" : "[49 74 61 6c 69 65]",
      "termFrequency" : 1
    }
  ]
},

......

  {
    "name" : "french_stemmer",
    "tokens" : [
      {
        "token" : "voyag",
        "start_offset" : 0,
        "end_offset" : 6,
        "type" : "<ALPHANUM>",
        "position" : 0,
        "positionLength" : 1,
        "keyword" : false,
        "bytes" : "[76 6f 79 61 67]",
        "termFrequency" : 1
      },
      {
        **"token" : "and",**
**            "start_offset" : 7,**
**            "end_offset" : 10,**
**            "type" : "<ALPHANUM>",**
**            "position" : 1,**
**            "positionLength" : 1,**
**            "keyword" : false,**
**            "bytes" : "[61 6e 64]",**
**            "termFrequency" : 1**
**          },**
**          {**
**            "token" : "and",**
**            "start_offset" : 14,**
**            "end_offset" : 17,**
**            "type" : "<ALPHANUM>",**
**            "position" : 3,**
**            "positionLength" : 1,**
**            "keyword" : false,**
**            "bytes" : "[61 6e 64]",**
**            "termFrequency" : 1**
      },
      {
        "token" : "ital",
        "start_offset" : 18,
        "end_offset" : 24,
        "type" : "<ALPHANUM>",
        "position" : 4,
        "positionLength" : 1,
        "keyword" : false,
        "bytes" : "[69 74 61 6c]",
        "termFrequency" : 1
      }
    ]
  }
]
  }
}

As you can see, everything is working as expected (stemmer, accents , lowercase, filters), but the tokenizer is still considering AND as a normal word.
This seems to be the reason this query returns nothing.

I am missing something there ......

abdon · May 14, 2018, 2:40pm

You're running into a bug that is scheduled to be fixed in 6.3.0. See this GitHub issue and the fix here: https://github.com/elastic/elasticsearch/pull/28871

6.3.0 will be released soon. In the meantime, what you could do is remove the ANDs from the query string itself. You already provide "default_operator": "AND", so the ANDs in the query string don't add anything.

The following query should work:

GET /json/_search
{
  "query": {
    "query_string": {
      "default_field": "tout",
      "default_operator": "AND",
      "query": "tout:(voyagé en Italie)"
    }
  }
}

ivanqwam · May 14, 2018, 3:18pm

Reading the bug report, I understand that this query will be affected by the bug:

voyagé AND en AND Italie

This explains why the search gives nothing.
OK for that.
Will wait for 6.3 and cross fingers

But I do not understand why AND is tagged

"type" : "<ALPHANUM>",

by the analyzer ?
Shouldn't it be tagged something like OPERATOR or something near ?
Is there a way to tell the analyzer that we are inside a query_string and that operators should be identified accordingly ?
Without that I cannot find a way to understand how a query_string is likely to be interpreted ....

Can you help me on that ?

abdon · May 14, 2018, 3:27pm

Yeah, the _analyze endpoint does not understand that you want to do a query_string query that accepts operators like AND.

My favorite API to figure out what is happening is the _validate API in combination with ?rewrite=true. This will show you the Lucene query that your Elasticsearch query will be rewritten to.

GET json/_validate/query?rewrite=true
{
  "query": {
    "query_string": {
      "default_field": "tout",
      "default_operator": "AND",
      "query": "tout:(voyagé AND en AND Italie)"
    }
  }
}

Shows you that the Lucene query is:

+tout:voyag +MatchNoDocsQuery("Matching no documents because no terms present.") +tout:ital

The +MatchNoDocsQuery(...) clause is what causes you to get no hits. On the other hand:

GET json/_validate/query?rewrite=true
{
  "query": {
    "query_string": {
      "default_field": "tout",
      "default_operator": "AND",
      "query": "tout:(voyagé en Italie)"
    }
  }
}

shows you that the Lucene query is +tout:voyag +tout:ital, resulting in hits.

ivanqwam · May 14, 2018, 4:11pm

Waow, that's great ! I tried it and it's a killer.

You answered everything.

Thank you Abdon, for your valuable help.

Consider this post closed.

system · June 11, 2018, 4:11pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Boolean operators entered in lower case returning unexpected results Elasticsearch	5	915	July 6, 2017
Org.apache.lucene.queryParser.ParseException: Cannot parse 'title: AND BREAKFAST': Encountered " <AND> "AND "" Elasticsearch	3	1604	July 6, 2017
Operator AND for match queries doesn't work Elasticsearch	9	1340	November 13, 2018
Query string operators seem to not be working correctly Elasticsearch	6	1542	July 6, 2017
Query string not working with keyword tokenizer Elasticsearch	9	3011	July 6, 2017

Operators not detected in analysis

Related topics