Elasticsearch for Data Science

Hi,
Some context. I'm using Elasticsearch and filebeat to store documents. I have 6 fields. One field represents the timestamp and the other 5 are keywords. Two fields correspond to IDs (id_1 and id_2). The IDs have many different values, for example, in a single day the cardinality of id_1 is around 4 million.

I want to apply machine learning techniques on the stored data. More precisely, in this first iteration I want to cluster the information (for example using k-means). Before trying to apply the algorithm I'm performing an exploration of the dataset. I'm interested on the cardinality of id_2 per id_1. To do this, I'm using a terms aggregation along with a cardinality sub aggregation. However, as I said before I have around 4 million id_1 values and the limit of the terms aggregation is around 65,000 terms. The structure of the aggregation is the following

"aggs": {
  "terms_id_1": {
    "terms": {
      "field": "id_1",
      "size": 10000
    },
    "aggs": {
      "card_id_2": {
        "cardinality": {
          "field": "id_2"
        }
      },
      "id_2_bucket_sort": {
        "bucket_sort": {
          "sort": [
            {"card_id_2": {"order": "desc"}}
          ]
        }
      }
    }
  }
}

I have two questions:

  1. Is it possible to perform the aggregation? I've tried before using the partition parameter of the term aggregation but I think it is not suitable to consider the complete period.
  2. Is Elasticsearch suitable for its use in the context of Machine Learning (supervised or unsupervised) or I should consider another tool. I've made research about this topic and the opinion are heterogeneous.

Best regards,

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.