Operator AND for match queries doesn't work


(Candela) #1

I'm using elaticsearch 6.2.3., a index with a field call text where the analyzer has the filters: lowercase, dutch_stopwords and synonyms and it has a field call stemmer:

      "text": {
        "type": "text",
        "fields": {
          "stemmer": {
            "type": "text",
            "analyzer": "stemmer_analyzer"
          }
        },
        "analyzer": "stopwords_synonyms_analyzer",
        "search_analyzer": "standard"

I have one text that includes STAALWAGEN and AC5-3967, when I do:

GET /index/_search
{
"query": {
"match": {
"text": "staalwagen ac5-3967"
}
}
}

I get the file but if I use the operator AND, I don't get any document.

GET /index/_search
{
"query": {
"match" : {
"text" : {
"query" : "staalwagen ac5-3967",
"operator" : "and"
}
}
}
}

And if I use text.stemmer I get others files with "staalwagen" and "ac5" but not the one with all:

GET /documentum_v1/_search
{
"query": {
"match" : {
"text.stemmer" : {
"query" : "staalwagen ac5-3967",
"operator" : "and"
}
}
}
}

Any ideas? Thanks!


(Abdon Pijpelink) #2

You have defined a search_analyzer (standard) that is different from the analyzer that is applied at index time (stopwords_synonyms_analyzer). The tokens produced for "staalwagen ac5-396" by the search analyzer are probably different than the tokens produced for "staalwagen ac5-3967" by the index-time analyzer. As a result, not all of the tokens produced by your query can be found in the inverted index. To validate that this is the case, you can use the _analyze API:

For the query time analyzer:

POST index/_analyze
{
  "analyzer": "standard", 
  "text": "staalwagen ac5-3967"
}

For the index time analyzer:

POST index/_analyze
{
  "analyzer": "stopwords_synonyms_analyzer", 
  "text": "staalwagen ac5-3967"
}

(Replace index in the requests above by the actual index name)

What I expect you will see is that not all tokens by the former are present in the latter.


(Candela) #3

Thank you, but all the tokens are the same:

GET /index/_analyze
{
  "analyzer": "stopwords_synonyms_analyzer",
  "text":"staalwagen AC5-3967"
}

{
"tokens": [
{
"token": "staalwagen",
"start_offset": 0,
"end_offset": 10,
"type": "",
"position": 0
},
{
"token": "ac5",
"start_offset": 11,
"end_offset": 14,
"type": "",
"position": 1
},
{
"token": "3967",
"start_offset": 15,
"end_offset": 19,
"type": "",
"position": 2
}
]
}

GET /index/_analyze
{
  "analyzer": "stemmer_analyzer",
  "text":"staalwagen AC5-3967"
}

{
"tokens": [
{
"token": "staalwag",
"start_offset": 0,
"end_offset": 10,
"type": "",
"position": 0
},
{
"token": "ac5",
"start_offset": 11,
"end_offset": 14,
"type": "",
"position": 1
},
{
"token": "3967",
"start_offset": 15,
"end_offset": 19,
"type": "",
"position": 2
}
]
}

GET /index/_analyze
{
"analyzer": "standard",
"text":"staalwagen AC5-3967"
}

{
"tokens": [
{
"token": "staalwagen",
"start_offset": 0,
"end_offset": 10,
"type": "",
"position": 0
},
{
"token": "ac5",
"start_offset": 11,
"end_offset": 14,
"type": "",
"position": 1
},
{
"token": "3967",
"start_offset": 15,
"end_offset": 19,
"type": "",
"position": 2
}
]
}


(Abdon Pijpelink) #4

I'm not sure what could be going on here.

Would you be able to share the index settings? The output of the following command:

GET index/_settings

(I'm especially interested in the analysis section).

Also, would you be able to share the document that you can find with the first query but not with the second?

By the way, please format the requests/responses that you post on this forum using the </> button, so no special characters get lost.


(Candela) #5

I'd not be able to share the document. But the "analysis" section is:

{
  "index_v1": {
    "aliases": {
      "index": {}
    },
    "mappings": {
      "documentum": {
        "properties": {
          "chronicleId": {
            "type": "keyword"
          },
          "clicks": {
            "type": "float"
          },
          "indexTimestamp": {
            "type": "date"
          },
          "lastModifiedDate": {
            "type": "date"
          },
          "link": {
            "type": "keyword"
          },
          "objectId": {
            "type": "keyword"
          },
          "text": {
            "type": "text",
            "fields": {
              "stemmer": {
                "type": "text",
                "analyzer": "stemmer_analyzer"
              }
            },
            "analyzer": "stopwords_synonyms_analyzer",
            "search_analyzer": "standard"
          },
          "title": {
            "type": "text",
            "fields": {
              "stemmer": {
                "type": "text",
                "analyzer": "stemmer_analyzer"
              }
            },
            "analyzer": "stopwords_synonyms_analyzer",
            "search_analyzer": "standard"
          }
        }
      }
    },
    "settings": {
      "index": {
        "number_of_shards": "1",
        "provided_name": "index_v1",
        "creation_date": "1537514956420",
        "analysis": {
          "filter": {
            "stemmer_filter": {
              "type": "stemmer",
              "language": "dutch"
            },
            "synonyms": {
              "type": "synonym",
              "synonyms_path": "analysis/synonym.txt"
            },
            "autocomplete_filter": {
              "type": "edge_ngram",
              "min_gram": "3",
              "max_gram": "20"
            },
            "dutch_stopwords": {
              "type": "stop",
              "stopwords": [
                "_dutch_"
              ]
            }
          },
          "analyzer": {
            "stemmer_analyzer": {
              "filter": [
                "lowercase",
                "dutch_stopwords",
                "synonyms",
                "stemmer_filter"
              ],
              "type": "custom",
              "tokenizer": "standard"
            },
            "standard_analyzer": {
              "type": "standard"
            },
            "stopwords_analyzer": {
              "filter": [
                "lowercase",
                "dutch_stopwords"
              ],
              "type": "custom",
              "tokenizer": "standard"
            },
            "autocomplete_analyzer": {
              "filter": [
                "lowercase",
                "dutch_stopwords",
                "synonyms",
                "autocomplete_filter"
              ],
              "type": "custom",
              "tokenizer": "standard"
            },
            "stopwords_synonyms_analyzer": {
              "filter": [
                "lowercase",
                "dutch_stopwords",
                "synonyms"
              ],
              "type": "custom",
              "tokenizer": "standard"
            },
            "synonyms_analyzer": {
              "filter": [
                "lowercase",
                "synonyms"
              ],
              "type": "custom",
              "tokenizer": "standard"
            }
          }
        },
        "number_of_replicas": "0",
        "uuid": "KKXxzV5hTTy6wqpRE2dhcQ",
        "version": {
          "created": "6020399"
        }
      }
    }
  }
}

And the query I'm doing taking in account the synonyms is:

GET /index/_search
{
    "query": {
        "match" : {
            "text" : {
                "query" : "staalwagen ac5-3967",
                "analyzer": "stopwords_synonyms_analyzer", 
                "operator" : "and"
            }
        }
    }
}

The document is something like that:

" xxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xx xxx xxxxx xxxxx xx xxxxxxx xx xxxxx xxxxx staalwagen xxxxxx

xxxxxxx/xxxxxxx

AC5-3967

..."


(Abdon Pijpelink) #6

Thank you for that. I'm thinking the problem could be caused by the synonyms. Would you mind sharing the contents of the synonyms.txt file? If it is too big to post here you can create a gist.

If you prefer not to share the whole file, please share any entries in that file for the terms staalwagen, ac5 and 3967


(Candela) #7

The entries for staalwagen, ac5 and 3967:

ac5 => eschema,elektrisch schema,eplan
eschema,elektrisch schema,eplan
sw => staalwagen


(Abdon Pijpelink) #8

I think this first line in your synonym list is the problem:

ac5 => eschema,elektrisch schema,eplan

By applying these synonyms at index time, you are removing the ac5 token. It is replaced by the synonyms eschema, elektrisch schema and eplan. When you execute this query:

GET /index/_search
{
  "query": {
    "match": {
      "text": {
        "query": "staalwagen ac5-3967",
        "operator": "and"
      }
    }
  }
}

you are not applying those synonyms. So Elasticsearch will try to find documents that contain ac5 and it will find none, as the synonym definition has removed that token at index time.

To fix this, you could choose to not remove the original tokens in your synonym definition. For example, change this line:

ac5 => eschema,elektrisch schema,eplan

into:

ac5 => ac5,eschema,elektrisch schema,eplan

After applying that change in the synonym file, you can restart Elasticsearch to pick up changes in the file. After that, you will also have reindex your data. A handy trick to do so is to execute the following:

POST index/_update_by_query

(Candela) #9

Thank you very much!! It's fixed.


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.