Custom similarity is really slow, even when very simple (using doc.freq)

JF2018 · May 25, 2021, 1:09pm

Hi

I am trying to use custom similarities.
I wish to have a very simple similarity, where the score is just doc.freq. As I understand this is stored in the statistics, so it should be very fast.

When having ~30mill docs, where ~3mill docs matches, the request is taking 1-3 seconds.
I dont understand why this is so slow, since the similarity function in my head is as simple,- if not more simple, than ie the default build in BM25 similarity.

I hope someone can help me understand this problem better, as I feel I am missing some understanding on, how this works.

Thanks in advance
Jens

MAPPINGS

{
  "mappings": {
    "dynamic": "strict",
    "properties": {
        "field_1":{
            "similarity": "custom_similarity",
            "type": "text"
        },
        "field_2":{
            "similarity": "custom_similarity",
            "type": "text"
        },
        "field_3":{
            "similarity": "custom_similarity",
            "type": "text"
        }
    }
  },
  "settings": {
    "index": {
      "number_of_replicas": "0",
      "number_of_shards": "12",
      "refresh_interval": "30s",
      "similarity": {
        "custom_similarity": {
          "script": {
            "source": "return doc.freq;"
          },
          "type": "scripted"
        }
      }
    }
  }
}

QUERY

{
    "from": 0,
    "size": 15,
    "query": {
        "bool": {
            "minimum_should_match": 2,
            "should": [
                {
                    "bool": {
                        "should": [
                            {
                                "term": {
                                    "field_1": {
                                        "value": "592705521550"
                                    }
                                }
                            },
                            {
                                "term": {
                                    "field_2": {
                                        "value": "592705521550"
                                    }
                                }
                            },
                            {
                                "term": {
                                    "field_3": {
                                        "value": "592705521550"
                                    }
                                }
                            }
                        ]
                    }
                },
                {
                    "bool": {
                        "should": [
                            {
                                "term": {
                                    "field_1": {
                                        "value": "618475336552"
                                    }
                                }
                            },
                            {
                                "term": {
                                    "field_2": {
                                        "value": "618475336552"
                                    }
                                }
                            },
                            {
                                "term": {
                                    "field_3": {
                                        "value": "618475336552"
                                    }
                                }
                            }
                        ]
                    }
                },
                ... in total 15 of these should clauses with 3 term clauses in them
                        ]
                    }
                }
            ]
        }
    },
    "_source": false,
    "track_total_hits": 2147483647
}

RESPONSE

{
    "took": 1665,
    "timed_out": false,
...

spinscale · May 31, 2021, 8:48am

What you are seeing here, is the overhead of running a script for 3 million hits resulting in three million executions. The overhead is not accessing or passing the document frequency from the lucene index to the script, but starting/executing the script, even though the compiled script is cached.

system · June 28, 2021, 8:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch simple scripted similarity performance issues Elasticsearch	1	428	August 4, 2020
Slow Custom score query Elasticsearch	1	237	July 6, 2017
Speed up custom search query Elasticsearch	1	319	July 6, 2017
Improving custom_score query execution time Elasticsearch	4	383	July 6, 2017
Custom_score query performance hit Elasticsearch	4	617	July 6, 2017

Custom similarity is really slow, even when very simple (using doc.freq)

Related topics