Hi,
Some context. I'm using Elasticsearch and filebeat to store documents. I have 6 fields. One field represents the timestamp and the other 5 are keywords. Two fields correspond to IDs (id_1 and id_2). The IDs have many different values, for example, in a single day the cardinality of id_1 is around 4 million.
I want to apply machine learning techniques on the stored data. More precisely, in this first iteration I want to cluster the information (for example using k-means). Before trying to apply the algorithm I'm performing an exploration of the dataset. I'm interested on the cardinality of id_2 per id_1. To do this, I'm using a terms aggregation along with a cardinality sub aggregation. However, as I said before I have around 4 million id_1 values and the limit of the terms aggregation is around 65,000 terms. The structure of the aggregation is the following
"aggs": {
"terms_id_1": {
"terms": {
"field": "id_1",
"size": 10000
},
"aggs": {
"card_id_2": {
"cardinality": {
"field": "id_2"
}
},
"id_2_bucket_sort": {
"bucket_sort": {
"sort": [
{"card_id_2": {"order": "desc"}}
]
}
}
}
}
}
I have two questions:
- Is it possible to perform the aggregation? I've tried before using the partition parameter of the term aggregation but I think it is not suitable to consider the complete period.
- Is Elasticsearch suitable for its use in the context of Machine Learning (supervised or unsupervised) or I should consider another tool. I've made research about this topic and the opinion are heterogeneous.
Best regards,