I have a large ES index (7.7) that has a couple of million rows.
I want to be able to pull a days worth of data and using their UUID id values generate a hash (MD5 or SHA1 ideally).
I can do this easily using SQL:
SELECT MD5(CAST(ARRAY_AGG(m.id ORDER BY id) AS TEXT)) FROM message m WHERE m.created >= '2020-12-09' AND m.created < '2020-12-10';
(Returns: 'e54b3fcdb7e9b9fbfef7c5e7785a9eg0')
But combining or aggregating a single column into a long string and then returning a hash of that value is proving to be difficult using queries or painless scripting techniques.
Mapping (simplified):
curl -X PUT http://localhost:9200/test-docker/_mapping -H "Content-Type: application/json" -d '{"properties":{"id":{"type":"text"},"created":{"type":"date_nanos"},"application":{"type":"keyword"},"content":{"type":"text"}}}'
Closest I have come is (Note, I haven't hashed or refined by date range yet):
curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match_all": {}
},
"script_fields": {
"test": {
"script": {
"lang": "painless",
"source": "String agg = \"\";\nfor (int i = 0; i < doc[\u0027_id\u0027].length; i++) {\nagg += doc[\u0027_id\u0027][i];if (i >= doc[\u0027_id\u0027].length) {\nagg += \",\";\n}\n}\nreturn agg;"
}
}
},
"sort" : [
{ "created" : {"order" : "asc"}}
]
}'
But this continues to return every document individually with just one UUID value in the returned result:
"hits" : [
{
"_index" : "mobile-report-docker",
"_type" : "_doc",
"_id" : "4ff783b2-fe77-4c17-8000-370f47fb656c",
"_score" : null,
"fields" : {
"test" : [
"4ff783b2-fe77-4c17-8000-370f47fb656c"
]
},
"sort" : [
1607506728626093585
]
},
{
... etc