Help on optimization of aggregation request with fieldData set to true

Fhocs · August 20, 2018, 8:35am

Hello,

I got an index with a field mapped as :

"concatGuid": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            },
            "analyzer": "ConcatKpiAnalyzer",
            "fielddata": true
          },

the analyzer is an analyzer with the build in : path_hierarchy tokenizer.

I got 3 cluster all on Elastic 6.1.1 :

DEV : 2 nodes of 20gb of ram/4Cpu
Homol : 3 nodes 20gb of ram/8cpu
Prod : 7 nodes of 20gb of ram/8 cpu.

On all the cluster this index is getting data from logstash with a monthly pattern.
IndexName-%{YYYYMM}, it got 5 shard, 1 replica.

On the dev cluster I got an indice of ~500mb total all the request, search + aggregation execute with a time of approximate ~150ms.

On the Homologation cluster I got :

health status index                               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   profiler-2018.08                    WPbBOvaiRDG21VUDoZoIaA   5   1   44034960            0     43.3gb         21.2gb

The search alone take between 6 to 45 ms while with the aggregation it took 24 secondes...

Here the begining of the result :

{
 "took": 24254,
 "timed_out": false,
 "_shards": {
   "total": 40,
   "successful": 40,
   "skipped": 0,
   "failed": 0,
   "failures": []
 },
 "hits": {
   "total": 1179116,
   "max_score": 0,
   "hits": []
 },
 "aggregations": {
   "guid": {
     "doc_count_error_upper_bound": 1105,
     "sum_other_doc_count": 236336,
     "buckets": [
       {
         "key": "d7e69409-5aec-4e8e-ab02-b311782320b7",
         "doc_count": 279720,
         "status": {

My request look like this :

GET profiler-*/_search
{
  "query": {
    "term": {
      "runGuid": "b80c2146-2bc2-41fc-bac2-141baef82522"
    }
  },
  "size": 0, 
  "aggs" : {
        "guid" : {
            "terms" : { "field" : "concatGuid"
            ,"include": ".*"
            ,"exclude": ".*\\/.*"
            },
            "aggs" : {
              "status" : {
            "terms" : { "field" : "status"
                }
              }
          }
        }
  }
}

Here the cluster "health"

{
  "cluster_name": "liqor-sfy",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 3,
  "number_of_data_nodes": 3,
  "active_primary_shards": 429,
  "active_shards": 859,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}

How should I optimize my request ?

Should I add more node to the cluster ?
Merge the index so we got less than 5 shard ?
Roll the index weekly instead of monthly for lesser data volume ?
Is there another way to write this request so we don't use the fieldData true ?

Mark_Harwood · August 20, 2018, 9:35am

I don't think you need this. It matches everything.

Obviously running regexes over a lot of unique strings will be expensive.
If you prepare your content at index time so it doesn't need heavy filtering at query time that will help.

Fhocs · August 20, 2018, 9:51am

Thanks for the quick reply, I use it mainly cause as people drill down on the path I can modify the request to look like :

GET profiler-2018.08/_search
{
  "query": {
    "term": {
      "runGuid": "b80c2146-2bc2-41fc-bac2-141baef82522"
    }
  },
  "size": 0, 
  "aggs" : {
        "guid" : {
            "terms" : { "field" : "concatGuid"
            ,"include": "d7e69409-5aec-4e8e-ab02-b311782320b7\\/.*"
            ,"exclude": "d7e69409-5aec-4e8e-ab02-b311782320b7/.*\\/.*"
            },
            "aggs" : {
              "status" : {
            "terms" : { "field" : "status"
                }
              }
          }
        }
  }
}

As I want to aggregate on the bucket A\B but not return me A\B\C\D ... so I got less data to serialize.

Do you think I should roll the index ? Merge the shard ? Rewrite all my regex queries ?

Mark_Harwood · August 20, 2018, 10:31am

If you know the exact bucket you are looking for you should use an exact-match include clause, not a regex. Exact-match clauses are passed in an array e.g.

"include": ["d7e69409-5aec-4e8e-ab02-b311782320b7" ]

Fhocs · August 20, 2018, 10:44am

When someone click on a job with the guid "d7e69409-5aec-4e8e-ab02-b311782320b7" I need to fetch the bucket on the first level below.

Ie when someone click on the job A, I want to have bucket A\B , A\E , A\F but I don't want to have A\B\C or A\E\G

so my query is :

GET profiler-2018.08/_search
{
  "query": {
    "term": {
      "runGuid": "8fa2577e-e916-4711-8a3d-4d138d86d579"
    }
  },
  "size": 0, 
  "aggs" : {
        "guid" : {
            "terms" : { "field" : "concatGuid"
            ,"include": "4f77f29c-bacb-4134-83a8-edfebe15d694\\/.*"
            ,"exclude": "4f77f29c-bacb-4134-83a8-edfebe15d694\\/.*\\/.*"
            },
            "aggs" : {
              "status" : {
            "terms" : { "field" : "status"
                }
              }
          }
        }
  }
}

and it return me

"aggregations": {
    "guid": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "4f77f29c-bacb-4134-83a8-edfebe15d694/7bc8afe1-79fc-4f23-9c81-816d9d35fa3b",
          "doc_count": 324014,
          "status": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "Start",
                "doc_count": 162007
              },
              {
                "key": "Done",
                "doc_count": 162005
              },
              {
                "key": "Warning",
                "doc_count": 2
              }
            ]
          }
        }
      ]
    }
  }
}

If i run the query with

"include": ["4f77f29c-bacb-4134-83a8-edfebe15d694"]

I got :

"buckets": [
        {
          "key": "4f77f29c-bacb-4134-83a8-edfebe15d694",
          "doc_count": 324016,
          "status": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "Start",
                "doc_count": 162008
              },
              {
                "key": "Done",
                "doc_count": 162006
              },
              {
                "key": "Warning",
                "doc_count": 2
              }
            ]
          }
        }
      ]

the current level and not the one below it.

I may have 0-n level below, that's why i'm using the regex, that's why I'm also using fielddata field if I understood it correctly ?

Mark_Harwood · August 20, 2018, 10:47am

So maybe indexing a keyword field called parentID or something would be a way of avoiding the expensive regex to find the children?

Fhocs · August 20, 2018, 11:11am

I got a field with a keyword field named parentID, if I do :

GET profiler-2018.08/_search
{
  "query": {
    "term": {
      "runGuid": "8fa2577e-e916-4711-8a3d-4d138d86d579"
    }
  },
  "size": 0, 
  "aggs" : {
    "filtre": {
      "filter": { "term": {"parentId": "4f77f29c-bacb-4134-83a8-edfebe15d694"}},
        "aggs" : {
        "guid" : {
            "terms" : { "field" : "concatGuid"
            },
            "aggs" : {
              "status" : {
            "terms" : { "field" : "status"
                }
              }
          }
        }
      }
  }
}
}

I got

"aggregations": {
    "filtre": {
      "doc_count": 2,
      "guid": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "4f77f29c-bacb-4134-83a8-edfebe15d694",
            "doc_count": 2,
            "status": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "Done",
                  "doc_count": 1
                },
                {
                  "key": "Start",
                  "doc_count": 1
                }
              ]
            }
          },
          {
            "key": "4f77f29c-bacb-4134-83a8-edfebe15d694/7bc8afe1-79fc-4f23-9c81-816d9d35fa3b",
            "doc_count": 2,
            "status": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "Done",
                  "doc_count": 1
                },
                {
                  "key": "Start",
                  "doc_count": 1
                }
              ]
            }
          }
        ]
      }
    }
  }

Which is not what I want as for my bucket
"4f77f29c-bacb-4134-83a8-edfebe15d694/7bc8afe1-79fc-4f23-9c81-816d9d35fa3b"
it took 2 doc in count instead of all the children (ie : "doc_count": 324014, that was returned by me by the previous query)

But maybe I'm missing something ?

Mark_Harwood · August 20, 2018, 11:27am

Based on the docs you've shown I can only guess that there are 324012 docs with the concatGuid of 4f77f29c-bacb-4134-83a8-edfebe15d694/7bc8afe1-79fc-4f23-9c81-816d9d35fa3b that don't have a parentID of 4f77f29c-bacb-4134-83a8-edfebe15d694 ?
That should be a query you can run to test (all counts, assuming the common filter of the given 8fa... runGuid)

Fhocs · August 20, 2018, 11:44am

The field concatGuid is a field that got term generated by the path_hierarchy_token
My doc looks like this :

 {
        "_index": "profiler-2018.08",
        "_type": "doc",
        "_id": "2C7JSGUBGbf2qEcFCd-P",
        "_score": 15.599785,
        "_source": {
          "totalProcessedElementCount": 1,
          "parentId": "4f77f29c-bacb-4134-83a8-edfebe15d694",
          "phaseId": "7bc8afe1-79fc-4f23-9c81-816d9d35fa3b",
          "@version": "1",
          "phaseType": "Integration",
          "log": """d:\Homeware\Applications\Bin\Logs\debug-Watcher.html""",
          "criteria": "",
          "phaseName": "Integration",
          "region": null,
          "concatGuid": "4f77f29c-bacb-4134-83a8-edfebe15d694/7bc8afe1-79fc-4f23-9c81-816d9d35fa3b",
          "process": 35324,
          "type": "profiler",
          "runGuid": "8fa2577e-e916-4711-8a3d-4d138d86d579",
          "database": "SRVPARLIPD08\\MSPARLIPD11.Liqor111",
          "frequency": "Monthly",
          "environment": "TESTS-WW-M-Full",
          "date": "2018-08-17T18:47:32.3776634+02:00",
          "thread": 290,
          "@timestamp": "2018-08-17T16:47:39.343Z",
          "status": "Done",
          "component": "WatcherService",
          "name": "Integration of file Vanille.zip",
          "totalElapsedTimeInSeconds": 13350.2792692
        }
      },

It's a tree process of data, I may have a document that tell me that on the 7th level deep I got an error, so I need with aggregation to retrieve this information.

So yeah the 7th level deep document doesn't get the same parentId as the 2nd level one.

ie :
I want to display the status of A\B , B hold the status "Done" , but A\B\C\D (two level down B is in Error) so I need to display Error on B.
To achieve this result I did the aggregation with the regex to match on the correct term and I got a bucket telling me (if we use my previous example) that there is 2 process with status "Warning" and I therefore need to put my status to Warning.

If I did it with only using the parentId , yes it's faster but I don't get the "correct" status.

That's why I'm using path_hierarchy_token so the term will match on document much deeper in the tree.

I don't know if there was a better way to achieve this result ?

Fhocs · August 20, 2018, 4:23pm

Should I split the index ?

health status index                               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   profiler-2018.08                    WPbBOvaiRDG21VUDoZoIaA   5   1   44034960            0     43.3gb         21.2gb

Should I reduce the shard ? Increase the number of nodes ?

The query take between 8 to 45 seconds to complete...

system · September 17, 2018, 4:23pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.