When running this query, as a query_string :
voyagé AND en AND Italie
finds nothing, despite several documents containing "Nous avons voyagé en Italie". (forget the discussion about writing queries, this what our users ask )
Analyzing the query with the analyzer debugger, it seems that the AND operators are not detected and stay considered as regular words.
Even the standard analyzer behaves like this.
Here is my query:
curl -H'Content-Type: application/json' "$URL/$INDEX/_search?size=10&pretty=true" -d '{
"query": {
"query_string": {
"default_field": "tout",
"default_operator": "AND",
"query" : "tout:(voyagé AND en AND Italie)"
}
}
}'
Here is our index creation with our a french_analyzer:
curl -H'Content-Type: application/json' -XPUT "$URL/$INDEX" -d'{
"settings": {
"index": {
"number_of_shards" : 10,
"number_of_replicas" : 1,
"analysis": {
"filter": {
"french_elision": {
"type": "elision",
"articles_case": true,
"articles": [
"l", "m", "t", "qu", "n", "s",
"j", "d", "c", "jusqu", "quoiqu",
"lorsqu", "puisqu"
]
},
"french_stop": {
"type": "stop",
"stopwords": "_french_"
},
"delimiter": {
"type": "word_delimiter"
},
"accent": {
"type": "asciifolding"
},
"french_keywords": {
"type": "keyword_marker",
"keywords": ["exemple"]
},
"french_stemmer": {
"type": "stemmer",
"language": "light_french"
}
},
"char_filter": {
"myhtml_filter": {
"type": "html_strip",
"escaped_tags": ["strong"]
}
},
"analyzer": {
"french_analyzer": {
"tokenizer": "standard",
"filter": [
"delimiter",
"french_elision",
"accent",
"lowercase",
"french_stop",
"french_keywords",
"french_stemmer"
],
"char_filter": ["myhtml_filter"]
}
}
}
}
}
}
}'
When performing analysis with any of the analyzer:
curl -H'Content-Type: application/json' "$URL/$INDEX/_analyze?pretty=true" -d '{
"explain": "true",
"analyzer": "french_analyzer",
"text": "voyagé AND en AND Italie"
}'
We get:
{
"tokens" : [ ],
"detail" : {
"custom_analyzer" : true,
"charfilters" : [
{
"name" : "myhtml_filter",
"filtered_text" : [
"voyagé AND en AND Italie"
]
}
],
"tokenizer" : {
"name" : "standard",
"tokens" : [
{
"token" : "voyagé",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0,
"positionLength" : 1,
"bytes" : "[76 6f 79 61 67 c3 a9]",
"termFrequency" : 1
},
{
**"token" : "AND",**
** "start_offset" : 7,**
** "end_offset" : 10,**
** "type" : "<ALPHANUM>",**
** "position" : 1,**
** "positionLength" : 1,**
** "bytes" : "[41 4e 44]",**
** "termFrequency" : 1**
},
{
"token" : "en",
"start_offset" : 11,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2,
"positionLength" : 1,
"bytes" : "[65 6e]",
"termFrequency" : 1
},
{
**"token" : "AND",**
** "start_offset" : 14,**
** "end_offset" : 17,**
** "type" : "<ALPHANUM>",**
** "position" : 3,**
** "positionLength" : 1,**
** "bytes" : "[41 4e 44]",**
** "termFrequency" : 1**
},
{
"token" : "Italie",
"start_offset" : 18,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 4,
"positionLength" : 1,
"bytes" : "[49 74 61 6c 69 65]",
"termFrequency" : 1
}
]
},
......
{
"name" : "french_stemmer",
"tokens" : [
{
"token" : "voyag",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0,
"positionLength" : 1,
"keyword" : false,
"bytes" : "[76 6f 79 61 67]",
"termFrequency" : 1
},
{
**"token" : "and",**
** "start_offset" : 7,**
** "end_offset" : 10,**
** "type" : "<ALPHANUM>",**
** "position" : 1,**
** "positionLength" : 1,**
** "keyword" : false,**
** "bytes" : "[61 6e 64]",**
** "termFrequency" : 1**
** },**
** {**
** "token" : "and",**
** "start_offset" : 14,**
** "end_offset" : 17,**
** "type" : "<ALPHANUM>",**
** "position" : 3,**
** "positionLength" : 1,**
** "keyword" : false,**
** "bytes" : "[61 6e 64]",**
** "termFrequency" : 1**
},
{
"token" : "ital",
"start_offset" : 18,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 4,
"positionLength" : 1,
"keyword" : false,
"bytes" : "[69 74 61 6c]",
"termFrequency" : 1
}
]
}
]
}
}
As you can see, everything is working as expected (stemmer, accents , lowercase, filters), but the tokenizer is still considering AND as a normal word.
This seems to be the reason this query returns nothing.
I am missing something there ......