How to imporove poor performance for searching

robert774 · December 6, 2019, 10:43am

Hello,
I'm totally new in elasticsearch and looking for best approach for my case where I need to search relatively large (~500mb) file with two list of sha512 hashes(~4 000 000), I need to find if hash is on the list and on the which one. Result of my first approach is totally below expectation(~5sec for searching hash on the given list) that I even think maybe elastic is not solution for my problem and should go with rdbms.

sample file structure:

{
    "list1": [ "1ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52",        "32a381ecb65cee78f44a4c0e282af46bd8f1909522801e087d396e61ed998896d1a65bb074f73c9e22b886bfb71286d16cedebe8b3ea19823c46f23e7566afd2"
],
    "list2": [ "2ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52",        "42a381ecb65cee78f44a4c0e282af46bd8f1909522801e087d396e61ed998896d1a65bb074f73c9e22b886bfb71286d16cedebe8b3ea19823c46f23e7566afd2"
]
}

what I have done:

1 . put mapping for myindex:

{"mappings": {"properties": {
   "list1":    {
      "type": "keyword"
   },
   "list2":    {
      "type": "keyword"
      }
}}}

upload file with curl

 curl -X POST http://myhost:9201/myindex/_doc  -H "Content-Type: application/json" --data-binary @samplefile.txt

search for hash:

http://myhost:9201/myindex/_search?q=list1:1ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52&_source_excludes=*

it's working fine but slow (few secs), is there anything I can improve to make searching more effective (tried different queries terms,match but no difference in time)

HenningAndersen · December 6, 2019, 11:18am

@robert774, I think you are uploading all the hashes to just one document? That is likely to lead to issues since when returning the response, Elasticsearch reads out the document. Even with source exclude I think this is going to cause issues.

I can think of two immediate options:

Index each hash into its own document, maybe add a list indication field or use two different indices.
Do like now, but also specify the size as 0 to the request. You will be able to see if you hit anything from the total hits returned.

robert774 · December 6, 2019, 12:45pm

hello @HenningAndersen

I've tried option 2) size=0 and it really did the job, response in 2ms
very thanks for your support, it solved my problem

I'll check also option 1 which looks like more native for elasticsearch, just wanna to confirm if I understand it correctly - so I have to parse file and put each hash under it's own index?
e.x.

index=1ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52
id="" (not explicitly defined so it will be random)
{
	"hash":"1ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52",
	"list":"list1"
}

index=2ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52
id="" (not explicitly defined so it will be random)
{
	"hash":"2ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd99d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52",
	"list":"list2"
}

and query with

http://myhost:9201/_search?q=hash:1ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52

or just omit "hash" field in json body as it duplicate index, then query using just index

http://myhost:9201/1ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52/_search

HenningAndersen · December 6, 2019, 1:25pm

Hi @robert774,

you should not create an index per hash. An index can contain many docs, instead you should add all hashes into a single index (or two, one per list).

You could do something like this:

POST list1/_doc
{
  "hash": "1ee81....."
}

and then be able to search using:
GET list1/_search
{
"query" : {
"match" : {
"hash" : {
"query" : "1ee81...."
}
}
}
}

Like in your original setup, you should make the hash field type keyword, since you have no need for analysis on that field.

system · January 3, 2020, 1:25pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Hash matching queries Elasticsearch	5	2689	July 6, 2017
Elastic Search Query performance problem Elasticsearch	1	425	July 6, 2017
Scan/Scroll performance and cache Elasticsearch	11	3473	July 5, 2017
Memory usage in index Elasticsearch	23	1097	May 19, 2020
Queries get slow while indexing documents Elasticsearch	9	1785	November 5, 2020

How to imporove poor performance for searching

Related topics