I have a log of events containing API access data like [ timestamp, id, other_informations ]. We would like to find some information about it.
We know that we can understand things by watching the amount of request we have from one id during a specified period of time. The problem is that Top X gives thousands of normal id access and inverted top X doesn't help (thousands of "1" connection attempt by an id). The id carnality is tens of millions, a full histogram can't be built.
On the other hand, we know that if we can specify N and M to something like "show me 1000 ids that appear between N and M times in the period of observation" we will have the info we need.
If you're trying put these IDs on a visualization, you could get part of the way there by using a terms aggregation on ID, and then specifying the min_doc_count in the advanced JSON config. This will limit the terms on that axis to IDs that meet or exceed some threshold.
Once you do that, you can build up a list of IDs you are interested in and create a filter that limits all data to those IDs.
It is a somewhat manual process, but I think it is as far as Kibana will be able to take you until filters can be based on the results of queries or aggregations (maybe something like https://github.com/elastic/kibana/issues/16702).
Since there is no "max_doc_count", I have to combine with an ascending sort to have my information and Kibana say this is deprecated.
My next step is entity centric indexing, but it requires scripting skills which I don't have in the context of Elasticsearch. I'll open a separate topic when needed.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.