Indexed field to only contain synonyms

My thinking may be flawed but here is the problem I am attempting to solve.

I have a synonym filter/analyzer configured like this:

        analyzer: {
          synonym_analyzer: {
            tokenizer: 'standard',
            filter: ['lowercase', 'synonym_filter'],
          },
        },
        filter: {
          synonym_filter: {
            type: 'synonym',
            synonyms: [
              'mambo 5, mambo number 5 => mambo5',
            ],
          },
        },

///
      search_headline: {
        type: 'text',
        analyzer: "synonym_analyzer"
      },

In this example, when I index a document like this:

    { search_headline: "Mambo number 5" }

The result of the analysis will convert this field to:

    mambo5

So if I search for "number 5" on this field, it will miss because the original field was modified and the search term is not converted because it does not match the synonym.

My thought is to add an additional field:

      search_synonyms: {
        type: 'text',
        analyzer: 'synonym_analyzer'
      },
      search_headline: {
        type: 'text',
        copy_to: "search_synonyms"
      },

This way, the original field is not modified and we will apply synonym analysis on the new search_synonyms field.

My question is, is it possible to "remove" terms that don't match a synonym when indexing into the search_synonyms field?

For example, if I index this document:

    { search_headline: "My favorite song is Mambo number 5" }

I would like the indexed document to look like this:

    {
        search_synonyms: "mambo5", // could be an array if multiple synonyms are found
        search_headline: "My favorite song is Mambo number 5
    }

That way I could search across both fields -and- the search_synonyms field won't cause so much duplication in the indexed content.

Like I said, my way of thinking may be flawed so hopefully this makes sense.

Hi @hatoms

Maybe you can read about this blog post to remove confusion on how synonyms works.

Hi,

is it possible to avoid the synonym expansion at index time and only use a synonym expansion at search time? Something like "mambo 5, mambo number five" like in this example:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
          "synonym_analyzer": {
            "tokenizer": "standard",
            "filter": ["lowercase", "synonym_filter"]
          }
        },
        "filter": {
          "synonym_filter": {
            "type": "synonym_graph",
            "synonyms": [
              "mambo 5, mambo number 5"
            ]
          }
        }
    }
  },
  "mappings": {
    "properties": {
      "title" : {
        "type": "text",
        "analyzer": "standard", 
        "search_analyzer": "synonym_analyzer"
      }
    }
  }
}

PUT /my_index/_bulk
{"index" : {}}
{"title" : "mambo 5"}
{"index" : {}}
{"title" : "mambo number 5"}
{"index" : {}}
{"title" : "number 5 lives"}

POST /my_index/_search
{
  "query": {
    "match_phrase": {
      "title": "mambo number 5"
    }
  }
}

Searching for "mambo 5" and "mambo number 5" should expand to the other case respectively, so even though you indexed them in different ways (using only "standard" analyzer at index time), both docs should be matched. Or isn't this feasible for some other reason?

Cheers,

This is definitely the approach I'm leaning toward (expansion at search time). I will need to do more testing to validate this, but I don't think match_phrase works for me.

The biggest issue I see here (which you've solved using match_phrase) is that given this adjustment that demonstrates the issue:

            "synonyms": [
              "mambo, mambo 5, mambo number 5"
            ]

At search time, a search for query: mambo will expand to the following terms:

    ["mambo", "number", "5"]

Thus, it would match a document like this:

    { "headline": "5 is the best number"}

The user never typed in 5, but because of multi-word synonym expansion, the token '5' is generated; causing it to match unrelated documents.

Maybe this is a better example:

             "synonyms": [
              "blockbuster, movie rental"
            ]
            //..
             headline: {
                    type: 'text',
                    analyzer: 'standard',
                   search_analyzer: 'synonym_analyzer',
              },

Where these documents are indexed:

    { headline: "Blockbuster" }
    { headline: "Spaceballs: The Movie" }  

If you search for:

    multi_match: {
        query: "Blockbuster",
        fields: [ "headline" ]
    }

Then you would get Spaceballs: The Movie as a result when it is not relevant and is only included because of the multi-word synonym expansion

edit: Also wanted to say thanks, I highly appreciate the input :slight_smile:

Thank you for the link, I started my original proof-of-concept off of this article and other but multi-word synonym expansion is the core issue I have understanding.

I'm not sure if I'm overthinking the effect this will have on scoring documents, but I can see a case where the search term is expanded into multiple tokens that recall documents completely unrelated to the initial search.

I've tried to detail my research in hopes someone has thought through the same scenario

I should have added, that I was using the synonym token filter.

Changing this to synonym_graph excludes those unwanted matches from my search in my test-bench. From the brief, and unfinished, reading I've done it's because the synonym_graph filter uses positional awareness.

I'll finish reading on token streams and the synonym_graph filter and get back to this thread

Using the synonym_graph token filter for search-time expansion works for my use case! Thanks for the input.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.