Hello,
I'm totally new in elasticsearch and looking for best approach for my case where I need to search relatively large (~500mb) file with two list of sha512 hashes(~4 000 000), I need to find if hash is on the list and on the which one. Result of my first approach is totally below expectation(~5sec for searching hash on the given list) that I even think maybe elastic is not solution for my problem and should go with rdbms.
it's working fine but slow (few secs), is there anything I can improve to make searching more effective (tried different queries terms,match but no difference in time)
@robert774, I think you are uploading all the hashes to just one document? That is likely to lead to issues since when returning the response, Elasticsearch reads out the document. Even with source exclude I think this is going to cause issues.
I can think of two immediate options:
Index each hash into its own document, maybe add a list indication field or use two different indices.
Do like now, but also specify the size as 0 to the request. You will be able to see if you hit anything from the total hits returned.
I've tried option 2) size=0 and it really did the job, response in 2ms
very thanks for your support, it solved my problem
I'll check also option 1 which looks like more native for elasticsearch, just wanna to confirm if I understand it correctly - so I have to parse file and put each hash under it's own index?
e.x.
index=1ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52
id="" (not explicitly defined so it will be random)
{
"hash":"1ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52",
"list":"list1"
}
index=2ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52
id="" (not explicitly defined so it will be random)
{
"hash":"2ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd99d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52",
"list":"list2"
}
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.