How to increase efficiency of search queries in Elasticsearch

70mb in total?

What gives

GET /_cat/indices?v

This is how it looks right now and I am fetching documents from second index

Please don't post images of text as they are hardly readable and not searchable.

Instead paste the text and format it with </> icon. Check the preview window.

Can you share the code you are using to extract all the data?

I am using it in python

first I made my connection

es = ES(host=host, port=port, timeout=100)

then I have search first 10000 documents

scroll_data = es.search(
  index = index,
  scroll = '1m',
  size = 10000,
body={
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "language" : "3"
          }
        },
        {
          "match": {
            "statusid" : "5"
          }
        },
        {
          "range": {
            "cp": {
              "gte": "1"
            }
          }
        },
            { "wildcard": { "list" : "*17*" }}
      ]
    }
  }
})

**then I have use scroll for the rest **

sid = scroll_data['_scroll_id']
scroll_size = scroll_data['hits']['total']['value'] - size
final_scroll_data = []
final_scroll_data.extend(scroll_data['hits']['hits'])

while (scroll_size > 0):
    each_scroll = es.scroll(scroll_id = sid, scroll = '1m')
    sid = each_scroll['_scroll_id']
    scroll_size = scroll_size - len(each_scroll['hits']['hits'])
    final_scroll_data.extend(each_scroll['hits']['hits'])

You are using a wildcard query with wildcards at both ends, which is the most inefficient query you can run in Elasticsearch. What does CPU usage look like while the query is running? What is the specification of the host?

2 Likes

Yeah. This is weird:

{ "wildcard": { "list" : "*17*" }}

Look at the documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html

Avoid beginning patterns with * or ? . This can increase the iterations needed to find matching terms and slow search performance.

What gives:

GET your_index/_search
{
  "size": 1,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "language" : "3"
          }
        },
        {
          "match": {
            "statusid" : "5"
          }
        },
        {
          "range": {
            "cp": {
              "gte": "1"
            }
          }
        },
            { "wildcard": { "list" : "*17*" }}
      ]
    }
  }
}
1 Like

@dadoonet, @Christian_Dahlqvist Thanks for the suggestion.

But with or without wildcard it is taking the time. It totally depends on data.

Right now I Have made two nodes on two different SSD machine and there is no performance improvement

Can you share the output of the query I asked to run?

Given the small data size I would expect the data to be cached no matter what type of disk you have. What is CPU utilisation looking like? Do you have swap enabled? Are the other processes running on the host that could be interfering?

output of the query --

{
"took" : 40,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : 4.0,
"hits" : [
{
"_index" : "sc-surveys2",
"_type" : "surveys",
"_id" : "140277170216",
"_score" : 4.0,
"_source" : {
"@version" : "1",
"qualificationanswerid" : 216,
"description" : "Bio-Tech",
"surveystatusid" : 5,
"epc" : 0,
"qualificationid" : 70,
"qualificationanswerdesc" : "Bio-Tech",
"id" : "140277170216",
"supplierlist" : "17,603,554,28,27,623,581,307,101,126,30",
"surveyid" : 1402771,
"cpi" : 2.75,
"languageid" : 3,
"@timestamp" : "2019-06-13T09:35:59.449Z",
"qualificationname" : "STANDARD_INDUSTRY_PERSONAL"
}
}
]
}
}

What are now the first lines until hits if you change

 "size": 10000

when I run for-

"size": 10000

it gives result like-

{
"took" : 344,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : 4.0,
"hits" : [
{
"_index" : "sc-surveys2",
"_type" : "surveys",
"_id" : "13943646059",

Just curious - the "took" represents the time in milliseconds right? So it is 344 milliseconds to retrieve 10,000 documents as opposed to the 9 seconds mentioned, or am I missing something?

Exact. That's what I wanted to see.
So the rest of time is most likely spent on the network I'd say.

1 Like

okay thanks so is there anyway I can overcome this situation??

What if you run the extraction on the same machine elasticsearch is running?

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)

when I run the extraction on same machine it is quick as compared when i run on other machine. But in my scenario I want to run the extraction from different machine.

So we are now pretty sure that you have a network problem.
I don't think that we can fix anything on elasticsearch side but at least you should investigate what kind of network connection you have between your client and elasticsearch server...