Using groovy script within function_score and access field payload


(Tomer Praizler) #1

I am trying to understand what is the best way to access the payload while using groovy script.

I want to do something like this:

sum = 0;
categories  = doc['category'].values;
for (category in categories){
   sum += category.payload
}
return sum;

Based on elasticsearch documentation I can do it with _index which is not what I am looking for, _index give access to statistics in the scope of the index, and not for a specific document.

I want to go over every document, and take its payload and multiply it with some constant.

When doing this: _index['category'].get('term', _PAYLOADS) I will get a list of all payloads of "term", which is not what I am looking for.

Is there a way to access a field payload from the scope of a document?


(Britta Weber) #2

Payloads are per occurrence of a term in a document so one term can have several payloads in one document. _index['category'].get('term', _PAYLOADS) will give you an iterator over the payloads and should have as many elements as there are occurrences of this term in a document.This will always be an iterator even if the term occurs only once.
What do you mean by When doing this: _index['category'].get('term', _PAYLOADS) I will get a list of all payloads of "term"? Do you get back more than expected?


(Tomer Praizler) #3

Oh! so maybe I got this wrong.
Let me try to put it into an example.

If I have the following 3 documents:

doc1: 
{
   "id": 1,
   "categories": ["1000|0.1","1001|0.2"]
}

doc2:
{
   "id": 2,
   "categories": ["1000|0.6","1001|0.7"]
}

 doc3:
{
   "id": 3,
   "categories": ["1000|0.4","1001|0.5"]
}

If I will do this:

_index['categories'].get('1000', _PAYLOADS)

What I understood is that I will get an iterator on:

[0.1,0.6,0.4] 

3 times, 1 for each doc.
And not an iterator on:

[0.1] (in case of doc1)
[0.6] (in case of doc2)
[0.4] (in case of doc3)

Is that correct? or I got it wrong?

Thanks!


(Britta Weber) #4

You will get an iterator for each doc, each containing only the payloads for this document. There is actually no way to get all payloads for all documents at the same time.

Here is an example:

DELETE testidx

PUT testidx
{
  "mappings": {
    "doc": {
      "properties": {
        "categories": {
          "type": "string",
          "analyzer": "payload"
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "payload": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "delimited_payload_filter"
          ]
        }
      }
    }
  }
}

POST testidx/doc/
{
   "id": 1,
   "categories": ["1000|0.1","1001|0.2"]
}

POST testidx/doc/
{
   "id": 2,
   "categories": ["1000|0.2","1001|0.2"]
}

POST testidx/doc/
{
   "id": 3,
   "categories": ["1000|0.3","1001|0.2"]
}

GET testidx/doc/_search
{
  "fields": [
    "_source"
  ],
  "script_fields": {
    "payloads": {
      "script": "payloads = []; positions = _index['categories'].get('1000', _PAYLOADS); for(pos in positions){payloads.add(pos.payloadAsFloat(0))}; payloads"
    }
  }
}

yields:

"hits": {
      "total": 3,
      "max_score": 1,
      "hits": [
         {
            "_index": "testidx",
            "_type": "doc",
            "_id": "AU-sK5gxjoOwWjOATroI",
            "_score": 1,
            "_source": {
               "id": 3,
               "categories": [
                  "1000|0.3",
                  "1001|0.2"
               ]
            },
            "fields": {
               "payloads": [
                  [
                     0.3
                  ]
               ]
            }
         },
         {
            "_index": "testidx",
            "_type": "doc",
            "_id": "AU-sJk0KjoOwWjOATrk2",
            "_score": 1,
            "_source": {
               "id": 1,
               "categories": [
                  "1000|0.1",
                  "1001|0.2"
               ]
            },
            "fields": {
               "payloads": [
                  [
                     0.1
                  ]
               ]
            }
         },
         {
            "_index": "testidx",
            "_type": "doc",
            "_id": "AU-sK6BajoOwWjOATroJ",
            "_score": 1,
            "_source": {
               "id": 2,
               "categories": [
                  "1000|0.2",
                  "1001|0.2"
               ]
            },
            "fields": {
               "payloads": [
                  [
                     0.2
                  ]
               ]
            }
         }
      ]
   }

(Tomer Praizler) #5

This is awesome!! thanks!!!

Do you think it make sense to use _index to access payload at scale?
Meaning, I want to be able to query my index, running a groovy script which will use the _index on every query.
From what I read _index is not very performant.

So is it safe to use _index? or there is a more performant option?

And thanks again!!


(system) #6