Several difficulties with elasticsearch DSL


#1

Hello, would appreciate help with several difficulties that I have encountered:

  1. After applying terms aggregator and then bucket_selector, I would like the query to return the number of buckets left. I used stats_bucket together with filter_path to get only the count.
    a)The buckets count is limited to 10000, any other technique to get this result without this limitation? I basically need to count the number of values satisfying some condition which is using an aggregation function.
    b)I used filter_path to get only the count, is this efficient and elasticsearch knows to ignore the buckets data at early stage when using filter_path or is there some parameter somewhere in the aggregators DSL definition which allows to specify that buckets will only be counted and that there is no need to accumulate their contents?
    c)stats_bucket is a normal counter, is there a way to perform a unique values counter according to some value inside of the remaining buckets after bucket_selector? somehow to apply cardinality aggregation here?
  2. Is there a way to plot this resulting DSL in kibana to get a metric visualization which will just display the resulting counter? I tried all the possible dropboxes and options in the visualization screen without any success...
  3. Elasticsearch crashed (out of memory exception) in some queries and I solved it by increasing the JVM RAM memory randomly higher, is there maybe some known rate of how much data requires how much RAM memory so that the decision about the RAM configuration will be less random?

Thanks


#2

1.d)can this aggregation and filtering be achieved by just using "painless script" to write a script to calculate and filter and everything?


(Zachary Tong) #3

There's a special _bucket_count property that can be used in pipeline agg buckets_path parameter, which will count the bucket count of another aggregation instead of a metric. You can read about it here (scroll towards the bottom): https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline.html

Filter_path is an entirely post-processing step. After the entire response is complete, filter_path just parses the JSON and returns what's requested, throwing away everything else. So it doesn't save anything except network transfer and maybe some work client-side.

Related, pipeline aggs are post-processing too. They essentially replicate operations that would happen on the client-side. After aggregations are complete and fully merged/reduced, pipeline aggs then operate on the resulting list of buckets. So they don't save processing either, just make the client-side operations simpler by deferring it to the server.

Afraid not, we don't have a cardinality pipeline aggregator at the moment :frowning: You'll have to calculate distinct count yourself client side, using a set or map or similar.

I don't think the regular Kibana Visualize workflow supports pipeline aggregations. Time Series Visual Builder does work with pipeline aggs, but is tailored for time series data so it may not be a great fit. There may be a way to do custom vizualizations but I'm not an expert in Kibana, may have better luck asking that part of the question over in the Kibana side of the forums.

Generally, this happens from aggregations that are asking for too many buckets. Temporary memory to hold all the buckets in memory can overwhelm the heap and cause stability issues or OOM. It's hard to estimate how much memory will be used because it depends on the cardinality of the aggregation tree. E.g. 10 terms aggregations with 10 values each is a lot less than 5 terms aggregations with 100,000,000 values each.

You can add more nodes, reduce the size of your aggregations, switch to the composite aggregation, or increase the circuit breaker so an exception is thrown sooner. Or some combination of the above :slight_smile:


#4

@polyfractal
Thanks for the answers

  1. I already previously saw that doc you referred to and _bucket_count, but it is used there in a filter and not as an output. Is there a way to use it to send its value as an output in the result? Because that would solve the counting problem.
  2. If filter_path isn't effecient, then is there some way to ignore agg hits? similar to "size=0" used to ignore ordinary hits.
  3. I am new here, whats the best way to involve kibana people in this? to add a "kibana" tag to the question?
  4. can this aggregation and filtering be achieved by just using "painless script" to write a script to calculate and filter and everything? I crafted something but it returns status 504 gateway timeout when executed from kibana console and probably never finishes in the elastic process: (assume the records have a string field called "key" and corresponding keyword "key.keyword" and a string field called "value" and a corresponding keyword "value.keyword". my goal is to count the number of unique keys which have more than 4 unique values)

GET /index1/_search
{
"size": 0,
"aggs": {
"keys with more than 4 unique values": {
"scripted_metric": {
"init_script": "state.uniqueValuesPerKey = new HashMap()",
"map_script": "if (!doc['key.keyword'].empty && !state.uniqueValuesPerKey.containsKey(doc['key.keyword'])) { state.uniqueValuesPerKey.put(doc['key.keyword'], [doc['value.keyword']: true])}if (!doc['key.keyword'].empty && state.uniqueValuesPerKey.containsKey(doc['key.keyword']) && !state.uniqueValuesPerKey[doc['key.keyword']].containsKey(doc['value.keyword'])) { state.uniqueValuesPerKey[doc['key.keyword']].put(doc['value.keyword'], true)}",
"combine_script": "int count = 0; for (t in state.uniqueValuesPerKey) { if (t.size() > 4) { count += 1; } } return count",
"reduce_script": "int totalCount = 0; for (a in states) { totalCount += a; } return totalCount"
}
}
}
}

  1. The elastic process is stuck on 100% cpu because of that query, what is the correct way to stop it? its running on windows. To restart the windows service? Restarting the service did not respond at all, I had to terminate the process which is not ideal, was there another solution? Also after starting it again I got the feeling that it resumed that job, is there a way to check on it? UPDATE: It finally went to sleep. But I would still like to know what is the best practice of displaying and gracefully cancelling problematic queries in such situations.

  2. Can the desired query I described in (4) can be achieved using elastic xpack sql? https://www.elastic.co/guide/en/elasticsearch/reference/current/sql-syntax-select.html
    Does elastic xpack sql support subselects?


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.