How does synonym filter work in search phase


(Merlinv) #1

Hi there,
I create a sample synonym filter like below

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "tokenizer" : "standard",
                        "filter" : ["synonym"]
                    }
                },
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "synonyms" : ["girl, woman"]
                    }
                }
            }
        }
    },
    "mappings": {
        "items": {
            "properties": {
               "tag": {
                  "type":"text",
                  "analyzer": "synonym"
               }
            }
        }
    }
}

So when I search for girl, I can get result of items which contain both girl and woman tag.
Also when using _analyze API to test:

GET test-index/_analyze
{
  "analyzer": "synonym",
  "text": ["girl"]
}

I get 2 tokens: one is girl as type and woman as SYNONYM type.

Hence I wonder that when I search for girl, it will equal to search for (tag: girl AND tag:woman) or (tag:girl OR tag:woman)

And during index phase, is woman added to inverted index when indexing document contains girl tag ?

The last question is when I update the synonym list by adding man, boy synonyms and without reindex all of my documents, I still can get result of boy when searching for man.
Is that only applied for search phase with query like search (tag:man OR tag:boy) ?

Thanks,


(Peter Steenbergen) #2

Hi Merlin,

  1. For this part:

The last question is when I update the synonym list by adding man, boy synonyms and without reindex all of my documents, I still can get result of boy when searching for man .
Is that only applied for search phase with query like search (tag:man OR tag:boy) ?

Without a re-indexation, you cant search for that data since it is not added to the index.

  1. Question:

Hence I wonder that when I search for girl , it will equal to search for (tag: girl AND tag:woman) or (tag:girl OR tag:woman)

It will be as equal to an OR statement. If you need some customizations or more in depth scoring, you can use search synonyms instead of during indexing a document.

Greetings,
Peter


(Merlinv) #3

Hi Peter,

Without a re-indexation, you cant search for that data since it is not added to the index.

I want to confirm:
Let say I updated synonym list by adding man, boy without reindexing,
then when searching man, in search phase it will equal to (tag:man OR tag:boy) (1).
It didn't add boy to inverted index as synonym since there is no re-indexation (2)

But because of (1) OR statement, so the search result of man still contain boy data. (3)
Is it true ? Cuz in fact, I did test the above flow on ElasticCloud and I got result (3) for real.

Thanks,


(Christoph) #4

Hi @merlinv,

I think your understanding is right, let me just try to clarify with a short example. Lets assume we set up the test index like this:

DELETE /test_index

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "tokenizer" : "standard",
                        "filter" : ["synonym"]
                    }
                },
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "synonyms" : ["girl, woman"]
                    }
                }
            }
        }
    },
    "mappings": {
        "_doc": {
            "properties": {
               "tag": {
                  "type":"text",
                  "analyzer": "synonym"
               }
            }
        }
    }
}

PUT /test_index/_doc/1
{
  "tag" : "a girl"
}

PUT /test_index/_doc/2
{
  "tag" : "a woman"
}

PUT /test_index/_doc/3
{
  "tag" : "a man"
}

PUT /test_index/_doc/4
{
  "tag" : "a boy"
}

At this point, document 1 and 2 should have been indexed with synonyms, so a match query on either "woman" or "girl" should return both documents. If you try the same with "man" and "boy" you should only get one of the documents 3 and 4 back:

POST /test_index/_doc/_search
{
  "query" : {
    "match" : {
        "tag" : "girl"
    }
  }
}

In addition to that, if we use a term query, which isn't analysed (so the synonyms are not applied to the query at search time) we also get both documents 1 and 2, because in fact both tokens were indexed with the document

GET /test_index/_search
{
  "query": {
    "term": {
      "tag": {
        "value": "girl"
      }
    }
  }
}

If we now change the filter and add the second synonym rule:

POST /test_index/_close

PUT /test_index/_settings
{
    
        "index" : {
            "analysis" : {
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "synonyms" : ["girl, woman", 
                            "boy, man"]
                    }
                }
            }
        }
}

POST /test_index/_open

Now a match query will return both documents 3 and 4, even without reindexing. This is because the synonyms filter gets applied to the query at search time and expands it to both "boy" and "man" and searches for both:

POST /test_index/_doc/_search
{
  "query" : {
    "match" : {
        "tag" : "boy"
    }
  }
}

In contrast, the termquery (again, no synonym expansion here) will still only return one of the two documents, because in fact only one term got indexed (the additional rule wasn't in effect yet):

GET /test_index/_search
{
  "query": {
    "term": {
      "tag": {
        "value": "man"
      }
    }
  }
}

If we now index another 5th document:

PUT /test_index/_doc/5
{
  "tag" : "another boy"
}

this one now gets indexed using the newly updated synonym rule, so now also the term query will return it, regardless of you search for the term "man" or "boy".

GET /test_index/_search
{
  "query": {
    "term": {
      "tag": {
        "value": "man"
      }
    }
  }
}

Internally documents 3 and 4 will only have either "man" or "boy" indexed, wheras the newly indexed document contains both "boy" and "man" in the inverted index structure.

I hope this confirms and clarifies what was explained above.


(Merlinv) #5

Hi @cbuescher,

I understood.
Thank you for the detailed explanation.


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.