Combined index like database(RDBMS)？

ginger · July 15, 2017, 12:26pm

Hi.

I am testing es to analysis user data. I load a day's data of 1 million users into es. The common query pattern is using date histogram aggregation to analysis one user's data. I encounter performance problem when using following query:

GET /test/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "user_id": [
              "1095620139"
            ]
          }
        },
        {
          "range": {
            "timestamp": {
              "gte": "2017-07-13T03:00:00Z",
              "lt": "2017-07-13T04:00:00Z"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "result": {
      "date_histogram": {
        "field": "timestamp",
        "interval": "1m",
        "format": "yyyy-MM-dd HH:mm"
      },
      "aggs": {
        "max_of_field": {
          "max": {
            "field": "counter"
          }
        }
      }
    }
  }
}

It costs 200ms!

But when I remove time range filter as following,

GET /test/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "user_id": [
              "1095620139"
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "result": {
      "date_histogram": {
        "field": "timestamp",
        "interval": "1m",
        "format": "yyyy-MM-dd HH:mm"
      },
      "aggs": {
        "max_of_field": {
          "max": {
            "field": "counter"
          }
        }
      }
    }
  }
}

It only costs 20ms while processing a whole day's data!

I know the implement of the lucene AND processor which fetch documents that match the user_id and the time range separately, then perform set intersection. I think the performance problem is caused by the amount of documents that match the time range.

How to solve this performance problem? Any helps would be appreciated!

ginger · July 17, 2017, 2:47am

Any helps would be appreciated!

ginger · July 18, 2017, 6:31am

I have test ES 5.0 and ES 6.0 alpha2 with index sorting, but not get better.

Mikhail_Khludnev · July 18, 2017, 11:06am

can you check it with "profile" ?

ginger · July 19, 2017, 3:27am

"profile": {
    "shards": [
      {
        "id": "[8TfRDHgvS5i1U_702Td5rg][test][5]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "#ConstantScore(user_id:user_490) #timestamp:[1500080755000 TO 1500084354999]",
                "time_in_nanos": 378586448,
                "breakdown": {
                  "score": 0,
                  "build_scorer_count": 28,
                  "match_count": 0,
                  "create_weight": 43880,
                  "next_doc": 63718,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 380,
                  "score_count": 0,
                  "build_scorer": 378478441,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "ConstantScoreQuery",
                    "description": "ConstantScore(user_id:user_490)",
                    "time_in_nanos": 406738,
                    "breakdown": {
                      "score": 0,
                      "build_scorer_count": 28,
                      "match_count": 0,
                      "create_weight": 12584,
                      "next_doc": 69098,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 362,
                      "score_count": 0,
                      "build_scorer": 234086,
                      "advance": 90558,
                      "advance_count": 21
                    },
                    "children": [
                      {
                        "type": "TermQuery",
                        "description": "user_id:user_490",
                        "time_in_nanos": 334375,
                        "breakdown": {
                          "score": 0,
                          "build_scorer_count": 28,
                          "match_count": 0,
                          "create_weight": 5813,
                          "next_doc": 35608,
                          "match": 0,
                          "create_weight_count": 1,
                          "next_doc_count": 362,
                          "score_count": 0,
                          "build_scorer": 204156,
                          "advance": 88386,
                          "advance_count": 21
                        }
                      }
                    ]
                  },
                  {
                    "type": "IndexOrDocValuesQuery",
                    "description": "timestamp:[1500080755000 TO 1500084354999]",
                    "time_in_nanos": 377703084,
                    "breakdown": {
                      "score": 0,
                      "build_scorer_count": 21,
                      "match_count": 0,
                      "create_weight": 3532,
                      "next_doc": 849,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 18,
                      "score_count": 0,
                      "build_scorer": 377437519,
                      "advance": 260781,
                      "advance_count": 363
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 57006,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time_in_nanos": 777890,
                "children": [
                  {
                    "name": "MultiCollector",
                    "reason": "search_multi",
                    "time_in_nanos": 745265,
                    "children": [
                      {
                        "name": "TotalHitCountCollector",
                        "reason": "search_count",
                        "time_in_nanos": 22781
                      },
                      {
                        "name": "ProfilingAggregator: [org.elasticsearch.search.profile.aggregation.ProfilingAggregator@68640975]",
                        "reason": "aggregation",
                        "time_in_nanos": 659802
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": [
          {
            "type": "org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramAggregator",
            "description": "result2",
            "time_in_nanos": 491939,
            "breakdown": {
              "reduce": 0,
              "build_aggregation": 76644,
              "build_aggregation_count": 1,
              "initialize": 16863,
              "initialize_count": 1,
              "reduce_count": 0,
              "collect": 398070,
              "collect_count": 360
            },
            "children": [
              {
                "type": "org.elasticsearch.search.aggregations.metrics.max.MaxAggregator",
                "description": "max_of_field",
                "time_in_nanos": 122156,
                "breakdown": {
                  "reduce": 0,
                  "build_aggregation": 9376,
                  "build_aggregation_count": 60,
                  "initialize": 2935,
                  "initialize_count": 1,
                  "reduce_count": 0,
                  "collect": 109424,
                  "collect_count": 360
                }
              }
            ]
          }
        ]
      }
    ]
  }

It seems most of the time cost on build_scorer! It's a little strange for filter query.
The ES version is 6.0.0-alpha2

Mikhail_Khludnev · July 19, 2017, 9:36am

what's the mapping for timestamp? Is it possible to lose prescision? let's say round it to seconds or minutes?

ginger · July 19, 2017, 9:41am

The mapping for timestamp:

"timestamp": {
"type": "date"
}

I will test it as the following blog. Considering using ES 6.0, I think the time precision affect the performance.

ginger · July 20, 2017, 10:45am

I have tested time precision, and the performance keeps poor.

Any helps would be appreciated!

jpountz · July 21, 2017, 9:45am

The build_scorer cost is a bug caused by Make sure range queries are correctly profiled. by jpountz · Pull Request #25108 · elastic/elasticsearch · GitHub. The profiler disables an optimization that forces it to find all matches for the range in build_scorer even though in practice we would use doc-values to run the range so building the scorer would be almost free.

It only costs 20ms while processing a whole day's data!

The profile suggests that the term query that you run has few matches, maybe around 400. So it is very fast anyway and the term query spends more time locating the term in the terms dictionary than actually iterating over matches.

It looks like the range query on the date field matches significantly more documents than the term query, so Elasticsearch's best guess it to iterate over the matches of the term query and check whether the date range matches, which incurs overhead compared to just iterating overs the matches of the term query.

A 10x factor is a bit more than I would have expected. Is it reproducible consistently or do you have a lot of variance in the response times?

Mikhail_Khludnev · July 22, 2017, 8:20pm

ginger, it's not clear whether or not you have reindexed after changing the precision in mapping.

ginger · July 26, 2017, 1:22am

yes, I have reload the data.

ginger · July 26, 2017, 1:34am

Yes, I can reproduce it constantly. It's just a simple and common scene that data have time and id field, which has high cardinality.

BTW, before each query, I will execute /_cache/clear

system · August 23, 2017, 1:34am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Different values using date histogram Elasticsearch	1	193	March 24, 2023
Histogram Aggregation Performance issue on large dataset Elasticsearch	2	594	July 5, 2017
Aggregations for multiple keywords and date ranges (histogram) Elasticsearch	1	750	April 9, 2018
Query Optimization Elasticsearch	2	437	November 4, 2020
Speed-up Aggregation Data Histogram Aggregation query for static indices Elasticsearch	4	637	August 13, 2017

Combined index like database(RDBMS)？

Related topics