Can I combine a single fields search results and get a hash?

I have a large ES index (7.7) that has a couple of million rows.
I want to be able to pull a days worth of data and using their UUID id values generate a hash (MD5 or SHA1 ideally).

I can do this easily using SQL:

SELECT MD5(CAST(ARRAY_AGG(m.id ORDER BY id) AS TEXT)) FROM message m WHERE m.created >= '2020-12-09' AND m.created < '2020-12-10';

(Returns: 'e54b3fcdb7e9b9fbfef7c5e7785a9eg0')

But combining or aggregating a single column into a long string and then returning a hash of that value is proving to be difficult using queries or painless scripting techniques.

Mapping (simplified):

curl -X PUT http://localhost:9200/test-docker/_mapping -H "Content-Type: application/json" -d '{"properties":{"id":{"type":"text"},"created":{"type":"date_nanos"},"application":{"type":"keyword"},"content":{"type":"text"}}}'

Closest I have come is (Note, I haven't hashed or refined by date range yet):

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query": {
        "match_all": {}
    },
    "script_fields": {
        "test": {
            "script": {
            	"lang": "painless",
            	"source": "String agg = \"\";\nfor (int i = 0; i < doc[\u0027_id\u0027].length; i++) {\nagg += doc[\u0027_id\u0027][i];if (i >= doc[\u0027_id\u0027].length) {\nagg += \",\";\n}\n}\nreturn agg;"
            }
        }
    },
    "sort" : [
        { "created" : {"order" : "asc"}}
    ]
}'

But this continues to return every document individually with just one UUID value in the returned result:

"hits" : [
	{
		"_index" : "mobile-report-docker",
		"_type" : "_doc",
		"_id" : "4ff783b2-fe77-4c17-8000-370f47fb656c",
		"_score" : null,
		"fields" : {
		  "test" : [
		    "4ff783b2-fe77-4c17-8000-370f47fb656c"
		  ]
		},
		"sort" : [
		  1607506728626093585
		]
	},
	{
... etc

That is not very optimal, you could end with a string that is just too long for handle in the script, try to update the hash function for every record in the script that you have iterating the documents updating the hash with the next element; there are no crypto implementations in painless as far as I know, but you could use one from the internet.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.