Combined index like database(RDBMS)?


(jiangguoqiang) #1

Hi.

I am testing es to analysis user data. I load a day's data of 1 million users into es. The common query pattern is using date histogram aggregation to analysis one user's data. I encounter performance problem when using following query:

GET /test/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "user_id": [
              "1095620139"
            ]
          }
        },
        {
          "range": {
            "timestamp": {
              "gte": "2017-07-13T03:00:00Z",
              "lt": "2017-07-13T04:00:00Z"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "result": {
      "date_histogram": {
        "field": "timestamp",
        "interval": "1m",
        "format": "yyyy-MM-dd HH:mm"
      },
      "aggs": {
        "max_of_field": {
          "max": {
            "field": "counter"
          }
        }
      }
    }
  }
}

It costs 200ms!

But when I remove time range filter as following,

GET /test/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "user_id": [
              "1095620139"
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "result": {
      "date_histogram": {
        "field": "timestamp",
        "interval": "1m",
        "format": "yyyy-MM-dd HH:mm"
      },
      "aggs": {
        "max_of_field": {
          "max": {
            "field": "counter"
          }
        }
      }
    }
  }
}

It only costs 20ms while processing a whole day's data!

I know the implement of the lucene AND processor which fetch documents that match the user_id and the time range separately, then perform set intersection. I think the performance problem is caused by the amount of documents that match the time range.

How to solve this performance problem? Any helps would be appreciated!


(jiangguoqiang) #2

Any helps would be appreciated!


(jiangguoqiang) #3

I have test ES 5.0 and ES 6.0 alpha2 with index sorting, but not get better.


(Mikhail Khludnev) #4

can you check it with "profile" ?


(jiangguoqiang) #5
"profile": {
    "shards": [
      {
        "id": "[8TfRDHgvS5i1U_702Td5rg][test][5]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "#ConstantScore(user_id:user_490) #timestamp:[1500080755000 TO 1500084354999]",
                "time_in_nanos": 378586448,
                "breakdown": {
                  "score": 0,
                  "build_scorer_count": 28,
                  "match_count": 0,
                  "create_weight": 43880,
                  "next_doc": 63718,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 380,
                  "score_count": 0,
                  "build_scorer": 378478441,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "ConstantScoreQuery",
                    "description": "ConstantScore(user_id:user_490)",
                    "time_in_nanos": 406738,
                    "breakdown": {
                      "score": 0,
                      "build_scorer_count": 28,
                      "match_count": 0,
                      "create_weight": 12584,
                      "next_doc": 69098,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 362,
                      "score_count": 0,
                      "build_scorer": 234086,
                      "advance": 90558,
                      "advance_count": 21
                    },
                    "children": [
                      {
                        "type": "TermQuery",
                        "description": "user_id:user_490",
                        "time_in_nanos": 334375,
                        "breakdown": {
                          "score": 0,
                          "build_scorer_count": 28,
                          "match_count": 0,
                          "create_weight": 5813,
                          "next_doc": 35608,
                          "match": 0,
                          "create_weight_count": 1,
                          "next_doc_count": 362,
                          "score_count": 0,
                          "build_scorer": 204156,
                          "advance": 88386,
                          "advance_count": 21
                        }
                      }
                    ]
                  },
                  {
                    "type": "IndexOrDocValuesQuery",
                    "description": "timestamp:[1500080755000 TO 1500084354999]",
                    "time_in_nanos": 377703084,
                    "breakdown": {
                      "score": 0,
                      "build_scorer_count": 21,
                      "match_count": 0,
                      "create_weight": 3532,
                      "next_doc": 849,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 18,
                      "score_count": 0,
                      "build_scorer": 377437519,
                      "advance": 260781,
                      "advance_count": 363
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 57006,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time_in_nanos": 777890,
                "children": [
                  {
                    "name": "MultiCollector",
                    "reason": "search_multi",
                    "time_in_nanos": 745265,
                    "children": [
                      {
                        "name": "TotalHitCountCollector",
                        "reason": "search_count",
                        "time_in_nanos": 22781
                      },
                      {
                        "name": "ProfilingAggregator: [org.elasticsearch.search.profile.aggregation.ProfilingAggregator@68640975]",
                        "reason": "aggregation",
                        "time_in_nanos": 659802
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": [
          {
            "type": "org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramAggregator",
            "description": "result2",
            "time_in_nanos": 491939,
            "breakdown": {
              "reduce": 0,
              "build_aggregation": 76644,
              "build_aggregation_count": 1,
              "initialize": 16863,
              "initialize_count": 1,
              "reduce_count": 0,
              "collect": 398070,
              "collect_count": 360
            },
            "children": [
              {
                "type": "org.elasticsearch.search.aggregations.metrics.max.MaxAggregator",
                "description": "max_of_field",
                "time_in_nanos": 122156,
                "breakdown": {
                  "reduce": 0,
                  "build_aggregation": 9376,
                  "build_aggregation_count": 60,
                  "initialize": 2935,
                  "initialize_count": 1,
                  "reduce_count": 0,
                  "collect": 109424,
                  "collect_count": 360
                }
              }
            ]
          }
        ]
      }
    ]
  }

It seems most of the time cost on build_scorer! It's a little strange for filter query.
The ES version is 6.0.0-alpha2


(Mikhail Khludnev) #6

what's the mapping for timestamp? Is it possible to lose prescision? let's say round it to seconds or minutes?


(jiangguoqiang) #7

The mapping for timestamp:

"timestamp": {
"type": "date"
}

I will test it as the following blog. Considering using ES 6.0, I think the time precision affect the performance.


(jiangguoqiang) #8

I have tested time precision, and the performance keeps poor.

Any helps would be appreciated!


(Adrien Grand) #9

The build_scorer cost is a bug caused by https://github.com/elastic/elasticsearch/pull/25108. The profiler disables an optimization that forces it to find all matches for the range in build_scorer even though in practice we would use doc-values to run the range so building the scorer would be almost free.

It only costs 20ms while processing a whole day's data!

The profile suggests that the term query that you run has few matches, maybe around 400. So it is very fast anyway and the term query spends more time locating the term in the terms dictionary than actually iterating over matches.

It looks like the range query on the date field matches significantly more documents than the term query, so Elasticsearch's best guess it to iterate over the matches of the term query and check whether the date range matches, which incurs overhead compared to just iterating overs the matches of the term query.

A 10x factor is a bit more than I would have expected. Is it reproducible consistently or do you have a lot of variance in the response times?


(Mikhail Khludnev) #10

ginger, it's not clear whether or not you have reindexed after changing the precision in mapping.


(jiangguoqiang) #11

yes, I have reload the data.


(jiangguoqiang) #12

Yes, I can reproduce it constantly. It's just a simple and common scene that data have time and id field, which has high cardinality.

BTW, before each query, I will execute /_cache/clear


(system) #13

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.