Indexed field to only contain synonyms

hatoms · May 21, 2020, 7:10pm

My thinking may be flawed but here is the problem I am attempting to solve.

I have a synonym filter/analyzer configured like this:

        analyzer: {
          synonym_analyzer: {
            tokenizer: 'standard',
            filter: ['lowercase', 'synonym_filter'],
          },
        },
        filter: {
          synonym_filter: {
            type: 'synonym',
            synonyms: [
              'mambo 5, mambo number 5 => mambo5',
            ],
          },
        },

///
      search_headline: {
        type: 'text',
        analyzer: "synonym_analyzer"
      },

In this example, when I index a document like this:

    { search_headline: "Mambo number 5" }

The result of the analysis will convert this field to:

    mambo5

So if I search for "number 5" on this field, it will miss because the original field was modified and the search term is not converted because it does not match the synonym.

My thought is to add an additional field:

      search_synonyms: {
        type: 'text',
        analyzer: 'synonym_analyzer'
      },
      search_headline: {
        type: 'text',
        copy_to: "search_synonyms"
      },

This way, the original field is not modified and we will apply synonym analysis on the new search_synonyms field.

My question is, is it possible to "remove" terms that don't match a synonym when indexing into the search_synonyms field?

For example, if I index this document:

    { search_headline: "My favorite song is Mambo number 5" }

I would like the indexed document to look like this:

    {
        search_synonyms: "mambo5", // could be an array if multiple synonyms are found
        search_headline: "My favorite song is Mambo number 5
    }

That way I could search across both fields -and- the search_synonyms field won't cause so much duplication in the indexed content.

Like I said, my way of thinking may be flawed so hopefully this makes sense.

gabriel_tessier · May 24, 2020, 2:30am

Hi @hatoms

Maybe you can read about this blog post to remove confusion on how synonyms works.

cbuescher · May 25, 2020, 10:59am

Hi,

is it possible to avoid the synonym expansion at index time and only use a synonym expansion at search time? Something like "mambo 5, mambo number five" like in this example:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
          "synonym_analyzer": {
            "tokenizer": "standard",
            "filter": ["lowercase", "synonym_filter"]
          }
        },
        "filter": {
          "synonym_filter": {
            "type": "synonym_graph",
            "synonyms": [
              "mambo 5, mambo number 5"
            ]
          }
        }
    }
  },
  "mappings": {
    "properties": {
      "title" : {
        "type": "text",
        "analyzer": "standard", 
        "search_analyzer": "synonym_analyzer"
      }
    }
  }
}

PUT /my_index/_bulk
{"index" : {}}
{"title" : "mambo 5"}
{"index" : {}}
{"title" : "mambo number 5"}
{"index" : {}}
{"title" : "number 5 lives"}

POST /my_index/_search
{
  "query": {
    "match_phrase": {
      "title": "mambo number 5"
    }
  }
}

Searching for "mambo 5" and "mambo number 5" should expand to the other case respectively, so even though you indexed them in different ways (using only "standard" analyzer at index time), both docs should be matched. Or isn't this feasible for some other reason?

Cheers,

hatoms · May 27, 2020, 1:16am

This is definitely the approach I'm leaning toward (expansion at search time). I will need to do more testing to validate this, but I don't think match_phrase works for me.

The biggest issue I see here (which you've solved using match_phrase) is that given this adjustment that demonstrates the issue:

            "synonyms": [
              "mambo, mambo 5, mambo number 5"
            ]

At search time, a search for query: mambo will expand to the following terms:

    ["mambo", "number", "5"]

Thus, it would match a document like this:

    { "headline": "5 is the best number"}

The user never typed in 5, but because of multi-word synonym expansion, the token '5' is generated; causing it to match unrelated documents.

Maybe this is a better example:

             "synonyms": [
              "blockbuster, movie rental"
            ]
            //..
             headline: {
                    type: 'text',
                    analyzer: 'standard',
                   search_analyzer: 'synonym_analyzer',
              },

Where these documents are indexed:

    { headline: "Blockbuster" }
    { headline: "Spaceballs: The Movie" }

If you search for:

    multi_match: {
        query: "Blockbuster",
        fields: [ "headline" ]
    }

Then you would get Spaceballs: The Movie as a result when it is not relevant and is only included because of the multi-word synonym expansion

edit: Also wanted to say thanks, I highly appreciate the input

hatoms · May 27, 2020, 1:21am

Thank you for the link, I started my original proof-of-concept off of this article and other but multi-word synonym expansion is the core issue I have understanding.

I'm not sure if I'm overthinking the effect this will have on scoring documents, but I can see a case where the search term is expanded into multiple tokens that recall documents completely unrelated to the initial search.

I've tried to detail my research in hopes someone has thought through the same scenario

hatoms · May 27, 2020, 1:39am

I should have added, that I was using the synonym token filter.

Changing this to synonym_graph excludes those unwanted matches from my search in my test-bench. From the brief, and unfinished, reading I've done it's because the synonym_graph filter uses positional awareness.

I'll finish reading on token streams and the synonym_graph filter and get back to this thread

hatoms · May 29, 2020, 2:32pm

Using the synonym_graph token filter for search-time expansion works for my use case! Thanks for the input.

system · June 26, 2020, 2:32pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Custom Search analyzer Elasticsearch	4	568	January 12, 2017
How to search with synonym analyzer Elasticsearch	4	2495	December 29, 2016
Using synonym_graph means non-synonyms are not found Elasticsearch	9	296	March 22, 2023
Two custom analyzers with the same synonym filter - why no match Elasticsearch	1	118	September 18, 2023
How to query the stored, un-analyzed, form of an analyzed field? Elasticsearch	4	2362	July 6, 2017

Indexed field to only contain synonyms

Related topics