How to imporove poor performance for searching

Hello,
I'm totally new in elasticsearch and looking for best approach for my case where I need to search relatively large (~500mb) file with two list of sha512 hashes(~4 000 000), I need to find if hash is on the list and on the which one. Result of my first approach is totally below expectation(~5sec for searching hash on the given list) that I even think maybe elastic is not solution for my problem and should go with rdbms.

sample file structure:

{
    "list1": [ "1ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52",        "32a381ecb65cee78f44a4c0e282af46bd8f1909522801e087d396e61ed998896d1a65bb074f73c9e22b886bfb71286d16cedebe8b3ea19823c46f23e7566afd2"
],
    "list2": [ "2ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52",        "42a381ecb65cee78f44a4c0e282af46bd8f1909522801e087d396e61ed998896d1a65bb074f73c9e22b886bfb71286d16cedebe8b3ea19823c46f23e7566afd2"
]
}

what I have done:

1 . put mapping for myindex:

{"mappings": {"properties": {
   "list1":    {
      "type": "keyword"
   },
   "list2":    {
      "type": "keyword"
      }
}}}
  1. upload file with curl
 curl -X POST http://myhost:9201/myindex/_doc  -H "Content-Type: application/json" --data-binary @samplefile.txt
  1. search for hash:
http://myhost:9201/myindex/_search?q=list1:1ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52&_source_excludes=*

it's working fine but slow (few secs), is there anything I can improve to make searching more effective (tried different queries terms,match but no difference in time)

@robert774, I think you are uploading all the hashes to just one document? That is likely to lead to issues since when returning the response, Elasticsearch reads out the document. Even with source exclude I think this is going to cause issues.

I can think of two immediate options:

  1. Index each hash into its own document, maybe add a list indication field or use two different indices.
  2. Do like now, but also specify the size as 0 to the request. You will be able to see if you hit anything from the total hits returned.

hello @HenningAndersen

I've tried option 2) size=0 and it really did the job, response in 2ms :slight_smile:
very thanks for your support, it solved my problem

I'll check also option 1 which looks like more native for elasticsearch, just wanna to confirm if I understand it correctly - so I have to parse file and put each hash under it's own index?
e.x.

index=1ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52
id="" (not explicitly defined so it will be random)
{
	"hash":"1ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52",
	"list":"list1"
}

index=2ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52
id="" (not explicitly defined so it will be random)
{
	"hash":"2ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd99d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52",
	"list":"list2"
}

and query with

http://myhost:9201/_search?q=hash:1ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52

or just omit "hash" field in json body as it duplicate index, then query using just index

http://myhost:9201/1ee8177a981d52827acf78c649a3328bbdb3aaabf705ac5aece11507f8f7b3e32ef5eae38cd9d08cb53cdf6f72707920910cf454ca0b8466245367af880cdf52/_search

Hi @robert774,

you should not create an index per hash. An index can contain many docs, instead you should add all hashes into a single index (or two, one per list).

You could do something like this:

POST list1/_doc
{
  "hash": "1ee81....."
}

and then be able to search using:
GET list1/_search
{
"query" : {
"match" : {
"hash" : {
"query" : "1ee81...."
}
}
}
}

Like in your original setup, you should make the hash field type keyword, since you have no need for analysis on that field.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.