70mb in total?
What gives
GET /_cat/indices?v
70mb in total?
What gives
GET /_cat/indices?v
This is how it looks right now and I am fetching documents from second index
Please don't post images of text as they are hardly readable and not searchable.
Instead paste the text and format it with </>
icon. Check the preview window.
Can you share the code you are using to extract all the data?
I am using it in python
first I made my connection
es = ES(host=host, port=port, timeout=100)
then I have search first 10000 documents
scroll_data = es.search(
index = index,
scroll = '1m',
size = 10000,
body={
"query": {
"bool": {
"must": [
{
"match": {
"language" : "3"
}
},
{
"match": {
"statusid" : "5"
}
},
{
"range": {
"cp": {
"gte": "1"
}
}
},
{ "wildcard": { "list" : "*17*" }}
]
}
}
})
**then I have use scroll for the rest **
sid = scroll_data['_scroll_id']
scroll_size = scroll_data['hits']['total']['value'] - size
final_scroll_data = []
final_scroll_data.extend(scroll_data['hits']['hits'])
while (scroll_size > 0):
each_scroll = es.scroll(scroll_id = sid, scroll = '1m')
sid = each_scroll['_scroll_id']
scroll_size = scroll_size - len(each_scroll['hits']['hits'])
final_scroll_data.extend(each_scroll['hits']['hits'])
You are using a wildcard query with wildcards at both ends, which is the most inefficient query you can run in Elasticsearch. What does CPU usage look like while the query is running? What is the specification of the host?
Yeah. This is weird:
{ "wildcard": { "list" : "*17*" }}
Look at the documentation: Wildcard query | Elasticsearch Guide [8.11] | Elastic
Avoid beginning patterns with
*
or?
. This can increase the iterations needed to find matching terms and slow search performance.
What gives:
GET your_index/_search
{
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"language" : "3"
}
},
{
"match": {
"statusid" : "5"
}
},
{
"range": {
"cp": {
"gte": "1"
}
}
},
{ "wildcard": { "list" : "*17*" }}
]
}
}
}
@dadoonet, @Christian_Dahlqvist Thanks for the suggestion.
But with or without wildcard it is taking the time. It totally depends on data.
Right now I Have made two nodes on two different SSD machine and there is no performance improvement
Can you share the output of the query I asked to run?
Given the small data size I would expect the data to be cached no matter what type of disk you have. What is CPU utilisation looking like? Do you have swap enabled? Are the other processes running on the host that could be interfering?
output of the query --
{
"took" : 40,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : 4.0,
"hits" : [
{
"_index" : "sc-surveys2",
"_type" : "surveys",
"_id" : "140277170216",
"_score" : 4.0,
"_source" : {
"@version" : "1",
"qualificationanswerid" : 216,
"description" : "Bio-Tech",
"surveystatusid" : 5,
"epc" : 0,
"qualificationid" : 70,
"qualificationanswerdesc" : "Bio-Tech",
"id" : "140277170216",
"supplierlist" : "17,603,554,28,27,623,581,307,101,126,30",
"surveyid" : 1402771,
"cpi" : 2.75,
"languageid" : 3,
"@timestamp" : "2019-06-13T09:35:59.449Z",
"qualificationname" : "STANDARD_INDUSTRY_PERSONAL"
}
}
]
}
}
What are now the first lines until hits if you change
"size": 10000
when I run for-
"size": 10000
it gives result like-
{
"took" : 344,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : 4.0,
"hits" : [
{
"_index" : "sc-surveys2",
"_type" : "surveys",
"_id" : "13943646059",
Just curious - the "took" represents the time in milliseconds right? So it is 344 milliseconds to retrieve 10,000 documents as opposed to the 9 seconds mentioned, or am I missing something?
Exact. That's what I wanted to see.
So the rest of time is most likely spent on the network I'd say.
okay thanks so is there anyway I can overcome this situation??
What if you run the extraction on the same machine elasticsearch is running?
when I run the extraction on same machine it is quick as compared when i run on other machine. But in my scenario I want to run the extraction from different machine.
So we are now pretty sure that you have a network problem.
I don't think that we can fix anything on elasticsearch side but at least you should investigate what kind of network connection you have between your client and elasticsearch server...
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.