Operators not detected in analysis


(Ivan Monnier) #1

When running this query, as a query_string :

voyagé AND en AND Italie  

finds nothing, despite several documents containing "Nous avons voyagé en Italie". (forget the discussion about writing queries, this what our users ask :wink: )

Analyzing the query with the analyzer debugger, it seems that the AND operators are not detected and stay considered as regular words.
Even the standard analyzer behaves like this.

Here is my query:

curl  -H'Content-Type: application/json' "$URL/$INDEX/_search?size=10&pretty=true" -d '{ 
  "query": { 
 "query_string": { 
      "default_field": "tout", 
      "default_operator": "AND", 
      "query" : "tout:(voyagé AND en AND Italie)" 
 } 
  } 
}'

Here is our index creation with our a french_analyzer:

curl -H'Content-Type: application/json' -XPUT "$URL/$INDEX" -d'{
  "settings": {
"index":    {

  "number_of_shards" : 10, 
  "number_of_replicas" : 1, 

  "analysis": {

    "filter": {

      "french_elision": {
        "type":         "elision",
        "articles_case": true,
        "articles": [
            "l", "m", "t", "qu", "n", "s",
            "j", "d", "c", "jusqu", "quoiqu",
            "lorsqu", "puisqu"
          ]
      },

      "french_stop": {
        "type":       "stop",
        "stopwords":  "_french_" 
      },


      "delimiter": {
        "type":       "word_delimiter"
      },

      "accent": {
        "type":       "asciifolding"
      },

      "french_keywords": {
        "type":       "keyword_marker",
        "keywords":   ["exemple"] 
      },

      "french_stemmer": {
        "type":       "stemmer",
        "language":   "light_french"
      }
    },

    "char_filter": {
      "myhtml_filter": {
        "type": "html_strip",
        "escaped_tags": ["strong"]
      }
    },

    "analyzer": {

      "french_analyzer": {
        "tokenizer":  "standard",
        "filter": [
          "delimiter",
          "french_elision",
          "accent",
          "lowercase",
          "french_stop",
          "french_keywords",
          "french_stemmer"
        ],
        "char_filter": ["myhtml_filter"]
      }

    }

   }
}
   }
 }
}'

When performing analysis with any of the analyzer:

curl  -H'Content-Type: application/json' "$URL/$INDEX/_analyze?pretty=true" -d '{
  "explain": "true",
  "analyzer": "french_analyzer",
  "text": "voyagé AND en AND Italie"
}'

We get:

{
  "tokens" : [ ],
  "detail" : {
"custom_analyzer" : true,
"charfilters" : [
  {
    "name" : "myhtml_filter",
    "filtered_text" : [
      "voyagé AND en AND Italie"
    ]
  }
],
"tokenizer" : {
  "name" : "standard",
  "tokens" : [
    {
      "token" : "voyagé",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0,
      "positionLength" : 1,
      "bytes" : "[76 6f 79 61 67 c3 a9]",
      "termFrequency" : 1
    },
    {
      **"token" : "AND",**
**          "start_offset" : 7,**
**          "end_offset" : 10,**
**          "type" : "<ALPHANUM>",**
**          "position" : 1,**
**          "positionLength" : 1,**
**          "bytes" : "[41 4e 44]",**
**          "termFrequency" : 1**
    },
    {
      "token" : "en",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2,
      "positionLength" : 1,
      "bytes" : "[65 6e]",
      "termFrequency" : 1
    },
    {
      **"token" : "AND",**
**          "start_offset" : 14,**
**          "end_offset" : 17,**
**          "type" : "<ALPHANUM>",**
**          "position" : 3,**
**          "positionLength" : 1,**
**          "bytes" : "[41 4e 44]",**
**          "termFrequency" : 1**
    },
    {
      "token" : "Italie",
      "start_offset" : 18,
      "end_offset" : 24,
      "type" : "<ALPHANUM>",
      "position" : 4,
      "positionLength" : 1,
      "bytes" : "[49 74 61 6c 69 65]",
      "termFrequency" : 1
    }
  ]
},

......

  {
    "name" : "french_stemmer",
    "tokens" : [
      {
        "token" : "voyag",
        "start_offset" : 0,
        "end_offset" : 6,
        "type" : "<ALPHANUM>",
        "position" : 0,
        "positionLength" : 1,
        "keyword" : false,
        "bytes" : "[76 6f 79 61 67]",
        "termFrequency" : 1
      },
      {
        **"token" : "and",**
**            "start_offset" : 7,**
**            "end_offset" : 10,**
**            "type" : "<ALPHANUM>",**
**            "position" : 1,**
**            "positionLength" : 1,**
**            "keyword" : false,**
**            "bytes" : "[61 6e 64]",**
**            "termFrequency" : 1**
**          },**
**          {**
**            "token" : "and",**
**            "start_offset" : 14,**
**            "end_offset" : 17,**
**            "type" : "<ALPHANUM>",**
**            "position" : 3,**
**            "positionLength" : 1,**
**            "keyword" : false,**
**            "bytes" : "[61 6e 64]",**
**            "termFrequency" : 1**
      },
      {
        "token" : "ital",
        "start_offset" : 18,
        "end_offset" : 24,
        "type" : "<ALPHANUM>",
        "position" : 4,
        "positionLength" : 1,
        "keyword" : false,
        "bytes" : "[69 74 61 6c]",
        "termFrequency" : 1
      }
    ]
  }
]
  }
}

As you can see, everything is working as expected (stemmer, accents , lowercase, filters), but the tokenizer is still considering AND as a normal word.
This seems to be the reason this query returns nothing.

I am missing something there ......


(Abdon Pijpelink) #2

You're running into a bug that is scheduled to be fixed in 6.3.0. See this GitHub issue and the fix here: https://github.com/elastic/elasticsearch/pull/28871

6.3.0 will be released soon. In the meantime, what you could do is remove the ANDs from the query string itself. You already provide "default_operator": "AND", so the ANDs in the query string don't add anything.

The following query should work:

GET /json/_search
{
  "query": {
    "query_string": {
      "default_field": "tout",
      "default_operator": "AND",
      "query": "tout:(voyagé en Italie)"
    }
  }
}

(Ivan Monnier) #3

Reading the bug report, I understand that this query will be affected by the bug:

voyagé AND en AND Italie

This explains why the search gives nothing.
OK for that.
Will wait for 6.3 and cross fingers :wink:

But I do not understand why AND is tagged

"type" : "<ALPHANUM>",

by the analyzer ?
Shouldn't it be tagged something like OPERATOR or something near ?
Is there a way to tell the analyzer that we are inside a query_string and that operators should be identified accordingly ?
Without that I cannot find a way to understand how a query_string is likely to be interpreted ....

Can you help me on that ?


(Abdon Pijpelink) #4

Yeah, the _analyze endpoint does not understand that you want to do a query_string query that accepts operators like AND.

My favorite API to figure out what is happening is the _validate API in combination with ?rewrite=true. This will show you the Lucene query that your Elasticsearch query will be rewritten to.

GET json/_validate/query?rewrite=true
{
  "query": {
    "query_string": {
      "default_field": "tout",
      "default_operator": "AND",
      "query": "tout:(voyagé AND en AND Italie)"
    }
  }
}

Shows you that the Lucene query is:

+tout:voyag +MatchNoDocsQuery("Matching no documents because no terms present.") +tout:ital

The +MatchNoDocsQuery(...) clause is what causes you to get no hits. On the other hand:

GET json/_validate/query?rewrite=true
{
  "query": {
    "query_string": {
      "default_field": "tout",
      "default_operator": "AND",
      "query": "tout:(voyagé en Italie)"
    }
  }
}

shows you that the Lucene query is +tout:voyag +tout:ital, resulting in hits.


(Ivan Monnier) #5

Waow, that's great ! I tried it and it's a killer.

You answered everything.

Thank you Abdon, for your valuable help.

Consider this post closed.


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.