Unable to bypass restriction with synonym token filter

@radoslav.sholev,

Thanks for explaining the use case. I think I understand the issue better now.

I'm looking at some of the discussions around when this change happened (before my time as an employee), and it seems like we intended the synonym_graph search-time filter (docs here) to cover some of the cases that the old synonym filter used to cover before 6.0. See, for example, the discussion here:

I tried it out in Kibana. It looks like what I'm doing here doesn't store the tag on the document, but does let you use the tag in search:

PUT index_4
{
  "mappings": {
    "properties": {
      "text": {
        "analyzer": "custom_index_analyzer",
        "norms": false,
        "type": "text"
      }
    }
  },
  "settings": {
    "index.number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "custom_index_analyzer": {
          "char_filter": [],
          "filter": [
            "lowercase",
            "custom_lemma"
          ],
          "tokenizer": "standard",
          "type": "custom"
        },
        "custom_search_analyzer": {
          "char_filter": [],
          "filter": [
            "lowercase",
            "custom_wordpack"
          ],
          "tokenizer": "standard",
          "type": "custom"
        }
      },
      "char_filter": {},
      "filter": {
        "custom_wordpack": {
          "type": "synonym_graph",
          "synonyms": [
            "gym, _amenities"
          ]
        },
        "custom_lemma": {
          "type": "lemmagen",
          "lexicon": "en"
        }
      }
    },
    "index.number_of_shards": 1
  }
}

At index time, we use the LemmaGen analyzer on the text field. The tag isn't added.

However, when a search for _amenities using the search-time analyzer will search for the gym token.

POST index_4/_analyze
{
  "analyzer": "custom_search_analyzer",
  "text": "_amenities"
}

Result:

{
  "tokens" : [
    {
      "token" : "gym",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "_amenities",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

Now if we index one document with "gym" and another with "gyms":

PUT index_4/_doc/1
{
  "text": "nice chairs at the gym"
}

PUT index_4/_doc/2
{
  "text": "there are plenty of gyms in the city"
}

Our query will find and highlight both terms:

GET index_4/_search
{
  "query": {
    "match": {
      "text": {
        "query": "_amenities",
        "analyzer": "custom_search_analyzer"
      }
    }
  },
  "highlight": {
    "fields": {
      "text": {}
    }
  }
}

Result:

{
  [...],
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.27884474,
    "hits" : [
      {
        "_index" : "index_4",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.27884474,
        "_source" : {
          "text" : "nice chairs at the gym"
        },
        "highlight" : {
          "text" : [
            "nice chairs at the <em>gym</em>"
          ]
        }
      },
      {
        "_index" : "index_4",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.27884474,
        "_source" : {
          "text" : "there are plenty of gyms in the city"
        },
        "highlight" : {
          "text" : [
            "there are plenty of <em>gyms</em> in the city"
          ]
        }
      }
    ]
  }
}

And here's a stab at an aggregation that counts documents matching tags:

GET index_4/_search
{
  "query": {
    "match_all": {}
  },
  "size": 0,
  "aggs": {
    "tag_agg": {
      "filters": {
        "filters": {
          "amenities": {
            "match": {
              "text": {
                "query": "_amenities",
                "analyzer": "custom_search_analyzer"
              }
            }
          },
          "no_amenities": {
            "bool": {
              "must_not": {
                "match": {
                  "text": {
                    "query": "_amenities",
                    "analyzer": "custom_search_analyzer"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Result:

{
  [...],
  "aggregations" : {
    "tag_agg" : {
      "buckets" : {
        "amenities" : {
          "doc_count" : 2
        },
        "no_amenities" : {
          "doc_count" : 0
        }
      }
    }
  }
}

I think this approach gets closer to your use-case than the multi-field suggestion does, though I am not sure if it fully solves the problem. Please let me know what you think.

-William