Help on optimization of aggregation request with fieldData set to true

Hello,

I got an index with a field mapped as :

"concatGuid": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            },
            "analyzer": "ConcatKpiAnalyzer",
            "fielddata": true
          },

the analyzer is an analyzer with the build in : path_hierarchy tokenizer.

I got 3 cluster all on Elastic 6.1.1 :

  • DEV : 2 nodes of 20gb of ram/4Cpu
  • Homol : 3 nodes 20gb of ram/8cpu
  • Prod : 7 nodes of 20gb of ram/8 cpu.

On all the cluster this index is getting data from logstash with a monthly pattern.
IndexName-%{YYYYMM}, it got 5 shard, 1 replica.

On the dev cluster I got an indice of ~500mb total all the request, search + aggregation execute with a time of approximate ~150ms.

On the Homologation cluster I got :

health status index                               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   profiler-2018.08                    WPbBOvaiRDG21VUDoZoIaA   5   1   44034960            0     43.3gb         21.2gb

The search alone take between 6 to 45 ms while with the aggregation it took 24 secondes...

Here the begining of the result :

{
 "took": 24254,
 "timed_out": false,
 "_shards": {
   "total": 40,
   "successful": 40,
   "skipped": 0,
   "failed": 0,
   "failures": []
 },
 "hits": {
   "total": 1179116,
   "max_score": 0,
   "hits": []
 },
 "aggregations": {
   "guid": {
     "doc_count_error_upper_bound": 1105,
     "sum_other_doc_count": 236336,
     "buckets": [
       {
         "key": "d7e69409-5aec-4e8e-ab02-b311782320b7",
         "doc_count": 279720,
         "status": {

My request look like this :

GET profiler-*/_search
{
  "query": {
    "term": {
      "runGuid": "b80c2146-2bc2-41fc-bac2-141baef82522"
    }
  },
  "size": 0, 
  "aggs" : {
        "guid" : {
            "terms" : { "field" : "concatGuid"
            ,"include": ".*"
            ,"exclude": ".*\\/.*"
            },
            "aggs" : {
              "status" : {
            "terms" : { "field" : "status"
                }
              }
          }
        }
  }
} 

Here the cluster "health"

{
  "cluster_name": "liqor-sfy",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 3,
  "number_of_data_nodes": 3,
  "active_primary_shards": 429,
  "active_shards": 859,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}

How should I optimize my request ?

  • Should I add more node to the cluster ?
  • Merge the index so we got less than 5 shard ?
  • Roll the index weekly instead of monthly for lesser data volume ?
  • Is there another way to write this request so we don't use the fieldData true ?

I don't think you need this. It matches everything.

Obviously running regexes over a lot of unique strings will be expensive.
If you prepare your content at index time so it doesn't need heavy filtering at query time that will help.

Thanks for the quick reply, I use it mainly cause as people drill down on the path I can modify the request to look like :

GET profiler-2018.08/_search
{
  "query": {
    "term": {
      "runGuid": "b80c2146-2bc2-41fc-bac2-141baef82522"
    }
  },
  "size": 0, 
  "aggs" : {
        "guid" : {
            "terms" : { "field" : "concatGuid"
            ,"include": "d7e69409-5aec-4e8e-ab02-b311782320b7\\/.*"
            ,"exclude": "d7e69409-5aec-4e8e-ab02-b311782320b7/.*\\/.*"
            },
            "aggs" : {
              "status" : {
            "terms" : { "field" : "status"
                }
              }
          }
        }
  }
} 

As I want to aggregate on the bucket A\B but not return me A\B\C\D ... so I got less data to serialize.

Do you think I should roll the index ? Merge the shard ? Rewrite all my regex queries ?

If you know the exact bucket you are looking for you should use an exact-match include clause, not a regex. Exact-match clauses are passed in an array e.g.

"include": ["d7e69409-5aec-4e8e-ab02-b311782320b7" ]

When someone click on a job with the guid "d7e69409-5aec-4e8e-ab02-b311782320b7" I need to fetch the bucket on the first level below.

Ie when someone click on the job A, I want to have bucket A\B , A\E , A\F but I don't want to have A\B\C or A\E\G

so my query is :

GET profiler-2018.08/_search
{
  "query": {
    "term": {
      "runGuid": "8fa2577e-e916-4711-8a3d-4d138d86d579"
    }
  },
  "size": 0, 
  "aggs" : {
        "guid" : {
            "terms" : { "field" : "concatGuid"
            ,"include": "4f77f29c-bacb-4134-83a8-edfebe15d694\\/.*"
            ,"exclude": "4f77f29c-bacb-4134-83a8-edfebe15d694\\/.*\\/.*"
            },
            "aggs" : {
              "status" : {
            "terms" : { "field" : "status"
                }
              }
          }
        }
  }
} 

and it return me

"aggregations": {
    "guid": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "4f77f29c-bacb-4134-83a8-edfebe15d694/7bc8afe1-79fc-4f23-9c81-816d9d35fa3b",
          "doc_count": 324014,
          "status": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "Start",
                "doc_count": 162007
              },
              {
                "key": "Done",
                "doc_count": 162005
              },
              {
                "key": "Warning",
                "doc_count": 2
              }
            ]
          }
        }
      ]
    }
  }
}

If i run the query with

"include": ["4f77f29c-bacb-4134-83a8-edfebe15d694"]

I got :

"buckets": [
        {
          "key": "4f77f29c-bacb-4134-83a8-edfebe15d694",
          "doc_count": 324016,
          "status": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "Start",
                "doc_count": 162008
              },
              {
                "key": "Done",
                "doc_count": 162006
              },
              {
                "key": "Warning",
                "doc_count": 2
              }
            ]
          }
        }
      ]

the current level and not the one below it.

I may have 0-n level below, that's why i'm using the regex, that's why I'm also using fielddata field if I understood it correctly ?

So maybe indexing a keyword field called parentID or something would be a way of avoiding the expensive regex to find the children?

I got a field with a keyword field named parentID, if I do :

GET profiler-2018.08/_search
{
  "query": {
    "term": {
      "runGuid": "8fa2577e-e916-4711-8a3d-4d138d86d579"
    }
  },
  "size": 0, 
  "aggs" : {
    "filtre": {
      "filter": { "term": {"parentId": "4f77f29c-bacb-4134-83a8-edfebe15d694"}},
        "aggs" : {
        "guid" : {
            "terms" : { "field" : "concatGuid"
            },
            "aggs" : {
              "status" : {
            "terms" : { "field" : "status"
                }
              }
          }
        }
      }
  }
}
}

I got

"aggregations": {
    "filtre": {
      "doc_count": 2,
      "guid": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "4f77f29c-bacb-4134-83a8-edfebe15d694",
            "doc_count": 2,
            "status": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "Done",
                  "doc_count": 1
                },
                {
                  "key": "Start",
                  "doc_count": 1
                }
              ]
            }
          },
          {
            "key": "4f77f29c-bacb-4134-83a8-edfebe15d694/7bc8afe1-79fc-4f23-9c81-816d9d35fa3b",
            "doc_count": 2,
            "status": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "Done",
                  "doc_count": 1
                },
                {
                  "key": "Start",
                  "doc_count": 1
                }
              ]
            }
          }
        ]
      }
    }
  }

Which is not what I want as for my bucket
"4f77f29c-bacb-4134-83a8-edfebe15d694/7bc8afe1-79fc-4f23-9c81-816d9d35fa3b"
it took 2 doc in count instead of all the children (ie : "doc_count": 324014, that was returned by me by the previous query)

But maybe I'm missing something ?

Based on the docs you've shown I can only guess that there are 324012 docs with the concatGuid of 4f77f29c-bacb-4134-83a8-edfebe15d694/7bc8afe1-79fc-4f23-9c81-816d9d35fa3b that don't have a parentID of 4f77f29c-bacb-4134-83a8-edfebe15d694 ?
That should be a query you can run to test (all counts, assuming the common filter of the given 8fa... runGuid)

The field concatGuid is a field that got term generated by the path_hierarchy_token
My doc looks like this :

 {
        "_index": "profiler-2018.08",
        "_type": "doc",
        "_id": "2C7JSGUBGbf2qEcFCd-P",
        "_score": 15.599785,
        "_source": {
          "totalProcessedElementCount": 1,
          "parentId": "4f77f29c-bacb-4134-83a8-edfebe15d694",
          "phaseId": "7bc8afe1-79fc-4f23-9c81-816d9d35fa3b",
          "@version": "1",
          "phaseType": "Integration",
          "log": """d:\Homeware\Applications\Bin\Logs\debug-Watcher.html""",
          "criteria": "",
          "phaseName": "Integration",
          "region": null,
          "concatGuid": "4f77f29c-bacb-4134-83a8-edfebe15d694/7bc8afe1-79fc-4f23-9c81-816d9d35fa3b",
          "process": 35324,
          "type": "profiler",
          "runGuid": "8fa2577e-e916-4711-8a3d-4d138d86d579",
          "database": "SRVPARLIPD08\\MSPARLIPD11.Liqor111",
          "frequency": "Monthly",
          "environment": "TESTS-WW-M-Full",
          "date": "2018-08-17T18:47:32.3776634+02:00",
          "thread": 290,
          "@timestamp": "2018-08-17T16:47:39.343Z",
          "status": "Done",
          "component": "WatcherService",
          "name": "Integration of file Vanille.zip",
          "totalElapsedTimeInSeconds": 13350.2792692
        }
      },

It's a tree process of data, I may have a document that tell me that on the 7th level deep I got an error, so I need with aggregation to retrieve this information.

So yeah the 7th level deep document doesn't get the same parentId as the 2nd level one.

ie :
I want to display the status of A\B , B hold the status "Done" , but A\B\C\D (two level down B is in Error) so I need to display Error on B.
To achieve this result I did the aggregation with the regex to match on the correct term and I got a bucket telling me (if we use my previous example) that there is 2 process with status "Warning" and I therefore need to put my status to Warning.

If I did it with only using the parentId , yes it's faster but I don't get the "correct" status.

That's why I'm using path_hierarchy_token so the term will match on document much deeper in the tree.

I don't know if there was a better way to achieve this result ?

Should I split the index ?

health status index                               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   profiler-2018.08                    WPbBOvaiRDG21VUDoZoIaA   5   1   44034960            0     43.3gb         21.2gb

Should I reduce the shard ? Increase the number of nodes ?

The query take between 8 to 45 seconds to complete...

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.