Elasticsearch 5.6.x - Using _version in a function score query


(Nicolas Giraud) #1

I have a use case where I want to find the most popular shingles from a text corpus.

I made a first implementation some time ago in Elasticsearch 2.3.1. I would extract shingles from the text and index them in Elasticsearch, using an MD5 hash of the shingle as the document ID.

Here's the mapping:

   {
    "dynamic": "strict",
    "_all": {
      "enabled": false
    },
    "properties": {
      "assoOrigin": {
        "type": "keyword"
      },
      "assoShingle": {
        "type": "keyword",
        "fields": {
          "assoFirst": {
            "type": "text",
            "analyzer": "assoFirst"
          },
          "french": {
            "type": "text",
            "analyzer": "french"
          },
          "simple": {
            "type": "text",
            "analyzer": "simple"
          }
        }
      }
    }
  }

Then I would query like this:

POST asso_index/asso/_search
{
  "version": true,
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {
              "dis_max": {
                "tie_breaker": 0,
                "queries": [
                  {
                    "bool": {
                      "must": [
                        {
                          "match": {
                            "assoShingle.simple": "rock"
                          }
                        }
                      ],
                      "should": [
                        {
                          "match": {
                            "assoShingle.assoFirst": "rock"
                          }
                        }
                      ]
                    }
                  },
                  {
                    "bool": {
                      "must": [
                        {
                          "match": {
                            "assoShingle.french": "rock"
                          }
                        }
                      ],
                      "should": [
                        {
                          "match": {
                            "assoShingle.assoFirst": "rock"
                          }
                        }
                      ]
                    }
                  }
                ]
              }
            }
          ]
        }
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "_version"
          }
        }
      ]
    }
  }
}

This would work pretty well, as no document is ever deleted from the index, so _version actually gives the number of times the same shingle was indexed, hence conveniently the number of occurences of the shingle in the text.

However, when migrating to Elasticsearch 5.6.6, this is not working anymore, and I get the following error:

Fielddata is not supported on field [_version] of type [_version]

I don't see any good means of keeping track of the number of occurences of the shingles in ES, and since the text can be quite big, I'd rather avoid to calculate it externally.

I'd appreciate advice on how to work around this issue.

Cheers,

Nicolas


(Mike Sukmanowsky) #2

We're also running into the same issue. We were using _version as a cheap proxy for popularity in v 1.7. Using 6.1.3, we're no longer able to use _version in a function score query.


(Nicolas Giraud) #3

A workaround I can think of is not to deduplicate documents, and use a terms aggregation. It's a lot less elegant though, and will consume more disk space.

Alternatively, post-processing the generated index, by inserting the version in a document field could do the trick.

I'm investigating both solutions.


(Nicolas Giraud) #4

The simplest fix is to generate yourself an occurence counter with and update script.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.