Can I combine a single fields search results and get a hash?

ChristoKiwi · December 10, 2020, 3:18am

I have a large ES index (7.7) that has a couple of million rows.
I want to be able to pull a days worth of data and using their UUID id values generate a hash (MD5 or SHA1 ideally).

I can do this easily using SQL:

SELECT MD5(CAST(ARRAY_AGG(m.id ORDER BY id) AS TEXT)) FROM message m WHERE m.created >= '2020-12-09' AND m.created < '2020-12-10';

(Returns: 'e54b3fcdb7e9b9fbfef7c5e7785a9eg0')

But combining or aggregating a single column into a long string and then returning a hash of that value is proving to be difficult using queries or painless scripting techniques.

Mapping (simplified):

curl -X PUT http://localhost:9200/test-docker/_mapping -H "Content-Type: application/json" -d '{"properties":{"id":{"type":"text"},"created":{"type":"date_nanos"},"application":{"type":"keyword"},"content":{"type":"text"}}}'

Closest I have come is (Note, I haven't hashed or refined by date range yet):

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
    "query": {
        "match_all": {}
    },
    "script_fields": {
        "test": {
            "script": {
            	"lang": "painless",
            	"source": "String agg = \"\";\nfor (int i = 0; i < doc[\u0027_id\u0027].length; i++) {\nagg += doc[\u0027_id\u0027][i];if (i >= doc[\u0027_id\u0027].length) {\nagg += \",\";\n}\n}\nreturn agg;"
            }
        }
    },
    "sort" : [
        { "created" : {"order" : "asc"}}
    ]
}'

But this continues to return every document individually with just one UUID value in the returned result:

"hits" : [
	{
		"_index" : "mobile-report-docker",
		"_type" : "_doc",
		"_id" : "4ff783b2-fe77-4c17-8000-370f47fb656c",
		"_score" : null,
		"fields" : {
		  "test" : [
		    "4ff783b2-fe77-4c17-8000-370f47fb656c"
		  ]
		},
		"sort" : [
		  1607506728626093585
		]
	},
	{
... etc

Iker · December 10, 2020, 5:06am

That is not very optimal, you could end with a string that is just too long for handle in the script, try to update the hash function for every record in the script that you have iterating the documents updating the hash with the next element; there are no crypto implementations in painless as far as I know, but you could use one from the internet.

system · January 7, 2021, 5:07am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Hashing index fields using elasticsearch Elasticsearch	1	787	July 9, 2020
Collect fields into the hash Elasticsearch	2	560	July 6, 2017
How to combine multiple fields values within single field Elasticsearch	4	2873	December 19, 2019
Grouping with Elasticsearch (aggs) to join a field into a list of values Elasticsearch	2	474	June 7, 2018
Aggregation query returning single word results Elasticsearch	1	312	April 6, 2021

Can I combine a single fields search results and get a hash?

Related topics