Operator AND for match queries doesn't work

cmenendez · October 15, 2018, 9:44am

I'm using elaticsearch 6.2.3., a index with a field call text where the analyzer has the filters: lowercase, dutch_stopwords and synonyms and it has a field call stemmer:

      "text": {
        "type": "text",
        "fields": {
          "stemmer": {
            "type": "text",
            "analyzer": "stemmer_analyzer"
          }
        },
        "analyzer": "stopwords_synonyms_analyzer",
        "search_analyzer": "standard"

I have one text that includes STAALWAGEN and AC5-3967, when I do:

GET /index/_search
{
"query": {
"match": {
"text": "staalwagen ac5-3967"
}
}
}

I get the file but if I use the operator AND, I don't get any document.

GET /index/_search
{
"query": {
"match" : {
"text" : {
"query" : "staalwagen ac5-3967",
"operator" : "and"
}
}
}
}

And if I use text.stemmer I get others files with "staalwagen" and "ac5" but not the one with all:

GET /documentum_v1/_search
{
"query": {
"match" : {
"text.stemmer" : {
"query" : "staalwagen ac5-3967",
"operator" : "and"
}
}
}
}

Any ideas? Thanks!

abdon · October 15, 2018, 11:01am

You have defined a search_analyzer (standard) that is different from the analyzer that is applied at index time (stopwords_synonyms_analyzer). The tokens produced for "staalwagen ac5-396" by the search analyzer are probably different than the tokens produced for "staalwagen ac5-3967" by the index-time analyzer. As a result, not all of the tokens produced by your query can be found in the inverted index. To validate that this is the case, you can use the _analyze API:

For the query time analyzer:

POST index/_analyze
{
  "analyzer": "standard", 
  "text": "staalwagen ac5-3967"
}

For the index time analyzer:

POST index/_analyze
{
  "analyzer": "stopwords_synonyms_analyzer", 
  "text": "staalwagen ac5-3967"
}

(Replace index in the requests above by the actual index name)

What I expect you will see is that not all tokens by the former are present in the latter.

cmenendez · October 15, 2018, 12:29pm

Thank you, but all the tokens are the same:

GET /index/_analyze
{
  "analyzer": "stopwords_synonyms_analyzer",
  "text":"staalwagen AC5-3967"
}

{
"tokens": [
{
"token": "staalwagen",
"start_offset": 0,
"end_offset": 10,
"type": "",
"position": 0
},
{
"token": "ac5",
"start_offset": 11,
"end_offset": 14,
"type": "",
"position": 1
},
{
"token": "3967",
"start_offset": 15,
"end_offset": 19,
"type": "",
"position": 2
}
]
}

GET /index/_analyze
{
  "analyzer": "stemmer_analyzer",
  "text":"staalwagen AC5-3967"
}

{
"tokens": [
{
"token": "staalwag",
"start_offset": 0,
"end_offset": 10,
"type": "",
"position": 0
},
{
"token": "ac5",
"start_offset": 11,
"end_offset": 14,
"type": "",
"position": 1
},
{
"token": "3967",
"start_offset": 15,
"end_offset": 19,
"type": "",
"position": 2
}
]
}

GET /index/_analyze
{
"analyzer": "standard",
"text":"staalwagen AC5-3967"
}

{
"tokens": [
{
"token": "staalwagen",
"start_offset": 0,
"end_offset": 10,
"type": "",
"position": 0
},
{
"token": "ac5",
"start_offset": 11,
"end_offset": 14,
"type": "",
"position": 1
},
{
"token": "3967",
"start_offset": 15,
"end_offset": 19,
"type": "",
"position": 2
}
]
}

abdon · October 15, 2018, 2:16pm

I'm not sure what could be going on here.

Would you be able to share the index settings? The output of the following command:

GET index/_settings

(I'm especially interested in the analysis section).

Also, would you be able to share the document that you can find with the first query but not with the second?

By the way, please format the requests/responses that you post on this forum using the </> button, so no special characters get lost.

cmenendez · October 15, 2018, 2:29pm

I'd not be able to share the document. But the "analysis" section is:

{
  "index_v1": {
    "aliases": {
      "index": {}
    },
    "mappings": {
      "documentum": {
        "properties": {
          "chronicleId": {
            "type": "keyword"
          },
          "clicks": {
            "type": "float"
          },
          "indexTimestamp": {
            "type": "date"
          },
          "lastModifiedDate": {
            "type": "date"
          },
          "link": {
            "type": "keyword"
          },
          "objectId": {
            "type": "keyword"
          },
          "text": {
            "type": "text",
            "fields": {
              "stemmer": {
                "type": "text",
                "analyzer": "stemmer_analyzer"
              }
            },
            "analyzer": "stopwords_synonyms_analyzer",
            "search_analyzer": "standard"
          },
          "title": {
            "type": "text",
            "fields": {
              "stemmer": {
                "type": "text",
                "analyzer": "stemmer_analyzer"
              }
            },
            "analyzer": "stopwords_synonyms_analyzer",
            "search_analyzer": "standard"
          }
        }
      }
    },
    "settings": {
      "index": {
        "number_of_shards": "1",
        "provided_name": "index_v1",
        "creation_date": "1537514956420",
        "analysis": {
          "filter": {
            "stemmer_filter": {
              "type": "stemmer",
              "language": "dutch"
            },
            "synonyms": {
              "type": "synonym",
              "synonyms_path": "analysis/synonym.txt"
            },
            "autocomplete_filter": {
              "type": "edge_ngram",
              "min_gram": "3",
              "max_gram": "20"
            },
            "dutch_stopwords": {
              "type": "stop",
              "stopwords": [
                "_dutch_"
              ]
            }
          },
          "analyzer": {
            "stemmer_analyzer": {
              "filter": [
                "lowercase",
                "dutch_stopwords",
                "synonyms",
                "stemmer_filter"
              ],
              "type": "custom",
              "tokenizer": "standard"
            },
            "standard_analyzer": {
              "type": "standard"
            },
            "stopwords_analyzer": {
              "filter": [
                "lowercase",
                "dutch_stopwords"
              ],
              "type": "custom",
              "tokenizer": "standard"
            },
            "autocomplete_analyzer": {
              "filter": [
                "lowercase",
                "dutch_stopwords",
                "synonyms",
                "autocomplete_filter"
              ],
              "type": "custom",
              "tokenizer": "standard"
            },
            "stopwords_synonyms_analyzer": {
              "filter": [
                "lowercase",
                "dutch_stopwords",
                "synonyms"
              ],
              "type": "custom",
              "tokenizer": "standard"
            },
            "synonyms_analyzer": {
              "filter": [
                "lowercase",
                "synonyms"
              ],
              "type": "custom",
              "tokenizer": "standard"
            }
          }
        },
        "number_of_replicas": "0",
        "uuid": "KKXxzV5hTTy6wqpRE2dhcQ",
        "version": {
          "created": "6020399"
        }
      }
    }
  }
}

And the query I'm doing taking in account the synonyms is:

GET /index/_search
{
    "query": {
        "match" : {
            "text" : {
                "query" : "staalwagen ac5-3967",
                "analyzer": "stopwords_synonyms_analyzer", 
                "operator" : "and"
            }
        }
    }
}

The document is something like that:

" xxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xx xxx xxxxx xxxxx xx xxxxxxx xx xxxxx xxxxx staalwagen xxxxxx

xxxxxxx/xxxxxxx

AC5-3967

..."

abdon · October 15, 2018, 2:47pm

Thank you for that. I'm thinking the problem could be caused by the synonyms. Would you mind sharing the contents of the synonyms.txt file? If it is too big to post here you can create a gist.

If you prefer not to share the whole file, please share any entries in that file for the terms staalwagen, ac5 and 3967

cmenendez · October 15, 2018, 2:50pm

The entries for staalwagen, ac5 and 3967:

ac5 => eschema,elektrisch schema,eplan
eschema,elektrisch schema,eplan
sw => staalwagen

abdon · October 15, 2018, 3:01pm

I think this first line in your synonym list is the problem:

ac5 => eschema,elektrisch schema,eplan

By applying these synonyms at index time, you are removing the ac5 token. It is replaced by the synonyms eschema, elektrisch schema and eplan. When you execute this query:

GET /index/_search
{
  "query": {
    "match": {
      "text": {
        "query": "staalwagen ac5-3967",
        "operator": "and"
      }
    }
  }
}

you are not applying those synonyms. So Elasticsearch will try to find documents that contain ac5 and it will find none, as the synonym definition has removed that token at index time.

To fix this, you could choose to not remove the original tokens in your synonym definition. For example, change this line:

ac5 => eschema,elektrisch schema,eplan

into:

ac5 => ac5,eschema,elektrisch schema,eplan

After applying that change in the synonym file, you can restart Elasticsearch to pick up changes in the file. After that, you will also have reindex your data. A handy trick to do so is to execute the following:

POST index/_update_by_query

cmenendez · October 16, 2018, 9:31am

Thank you very much!! It's fixed.

system · November 13, 2018, 9:31am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Match query with operator "and", doesn't work when using synonyms analyzer Elasticsearch	1	187	June 28, 2023
Operators not detected in analysis Elasticsearch	5	426	June 11, 2018
Match query with "the" text parsed using "OR" even specified "AND" in the operator field Elasticsearch	2	350	July 6, 2017
Search multiple fields with “and” operator (but use fields' own analyzers) Elasticsearch	7	2458	July 6, 2017
Search and index analyzer not working as expected Elasticsearch	3	1920	July 6, 2017

Operator AND for match queries doesn't work

Related topics