Elasticsearch for Data Science

tmslara.a · August 1, 2023, 3:02pm

Hi,
Some context. I'm using Elasticsearch and filebeat to store documents. I have 6 fields. One field represents the timestamp and the other 5 are keywords. Two fields correspond to IDs (id_1 and id_2). The IDs have many different values, for example, in a single day the cardinality of id_1 is around 4 million.

I want to apply machine learning techniques on the stored data. More precisely, in this first iteration I want to cluster the information (for example using k-means). Before trying to apply the algorithm I'm performing an exploration of the dataset. I'm interested on the cardinality of id_2 per id_1. To do this, I'm using a terms aggregation along with a cardinality sub aggregation. However, as I said before I have around 4 million id_1 values and the limit of the terms aggregation is around 65,000 terms. The structure of the aggregation is the following

"aggs": {
  "terms_id_1": {
    "terms": {
      "field": "id_1",
      "size": 10000
    },
    "aggs": {
      "card_id_2": {
        "cardinality": {
          "field": "id_2"
        }
      },
      "id_2_bucket_sort": {
        "bucket_sort": {
          "sort": [
            {"card_id_2": {"order": "desc"}}
          ]
        }
      }
    }
  }
}

I have two questions:

Is it possible to perform the aggregation? I've tried before using the partition parameter of the term aggregation but I think it is not suitable to consider the complete period.
Is Elasticsearch suitable for its use in the context of Machine Learning (supervised or unsupervised) or I should consider another tool. I've made research about this topic and the opinion are heterogeneous.

Best regards,

system · August 29, 2023, 3:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to avoid or ease terms aggregation for fields with high cardinality? Elasticsearch	4	760	March 12, 2022
Terms aggregation on high cardinality field Elasticsearch aggregations	9	137	November 14, 2024
Unable to retreive aggregation for keyword field with high cardinality (Data Table) Elasticsearch	3	471	March 24, 2020
Clustering algorithms in ElasticSearch Kibana elastic-stack-machine-learning	7	4859	July 2, 2021
Elasticsearch aggregation on million or more data Elasticsearch	6	3620	February 7, 2022

Elasticsearch for Data Science

Related topics