Python Elasticsearch query not returning the expected results when running subsequent calls


(G Kerekes) #1

Hello,

I am querying an elasticsearch index from python. Issue 1 is that when I
change my query and rerun it, my objects in Python don't get refreshed
according to my modified query. Issue 2 is that even if I see that I got
some hits, no data comes through at all (eg I see I've got 85k hits, but
when I put it in a dictionary, it is blank).

from elasticsearch import Elasticsearch

es = Elasticsearch("host:port", timeout=600, max_retries=10, revival_delay=0)

origall = es.search('esdata' ,'primary',
{"query":
{"bool":
{"must_not":
[{
"term": {"file": "original"}
}]
}
}
,"size" : "0"}
)

total_o = origall['hits']['total']

At this stage for total_o I get 110k, which is correct. Then I rerun my
query after changing the size=0 to size=20, and if I want to have a look at
these 20 hits, I get nothing for this:

orig = origall['hits']['hits']print(orig)

Then I go back to my original query and change the must_not to must. In
this way I should get 85k hits, but after rerunning it I still get 110k in
total_o.

It is quite random when it works and when it doesn't. Sometimes I get my
expected 85k hits, but then this get stuck and when I change my query back
to get the 110k, it would still be 85k. Also sometimes I get data in my
orig = origall['hits']['hits'], but then let's say I change the size in my
query to 0, rerun it and the origall['hits']['hits'] will still give me
back the data.

I use Anaconda, but tried also in Pycharm and the default Python IDLE,
these behave the same. Tried to create separate ES connections for all my
queries, doesn't help. Played around with cache, but no luck.

I'm running it on a 64 bit, Windows 7 machine.

Any idea what I'm doing wrong? Many thanks,

Geza

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/adf4f92a-59f3-4189-ab87-8a2c13de7022%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Honza Král) #2

Hi Geza,

I don't understand what you mean by re-running, can you post the complete code?

When you do a search with size: 20, can you just print the result of
the search method and see if that data is there?

As a side note it looks like you are trying to filter out some data,
while this works with a query you will get much better performance
when using a filtered query and a filter instead of a query.

Honza

On Mon, Jan 13, 2014 at 10:38 AM, G Kerekes kerekesg@gmail.com wrote:

Hello,

I am querying an elasticsearch index from python. Issue 1 is that when I
change my query and rerun it, my objects in Python don't get refreshed
according to my modified query. Issue 2 is that even if I see that I got
some hits, no data comes through at all (eg I see I've got 85k hits, but
when I put it in a dictionary, it is blank).

from elasticsearch import Elasticsearch

es = Elasticsearch("host:port", timeout=600, max_retries=10,
revival_delay=0)

origall = es.search('esdata' ,'primary',
{"query":
{"bool":
{"must_not":
[{
"term": {"file": "original"}
}]
}
}
,"size" : "0"}
)

total_o = origall['hits']['total']

At this stage for total_o I get 110k, which is correct. Then I rerun my
query after changing the size=0 to size=20, and if I want to have a look at
these 20 hits, I get nothing for this:

orig = origall['hits']['hits']
print(orig)

Then I go back to my original query and change the must_not to must. In this
way I should get 85k hits, but after rerunning it I still get 110k in
total_o.

It is quite random when it works and when it doesn't. Sometimes I get my
expected 85k hits, but then this get stuck and when I change my query back
to get the 110k, it would still be 85k. Also sometimes I get data in my orig
= origall['hits']['hits'], but then let's say I change the size in my query
to 0, rerun it and the origall['hits']['hits'] will still give me back the
data.

I use Anaconda, but tried also in Pycharm and the default Python IDLE, these
behave the same. Tried to create separate ES connections for all my queries,
doesn't help. Played around with cache, but no luck.

I'm running it on a 64 bit, Windows 7 machine.

Any idea what I'm doing wrong? Many thanks,

Geza

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/adf4f92a-59f3-4189-ab87-8a2c13de7022%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CABfdDirFA4-fNP75F%3D0EqGdgDvqqeo7-Ufb5ST00EStrDCMo5g%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(G Kerekes) #3

Hi Honza,

This is my "full" code:

from elasticsearch import Elasticsearch
import json
import pandas as pd
import numpy as np
import os

create the connection to the ES

es = Elasticsearch("host:port", timeout=600, max_retries=10, revival_delay=0)

############################################################
####### READ IN THE ORIGINAL SURVEY DATA ###################
############################################################

origall = es.search('survey_data' ,'primary',
body = {"query":
{"bool":
{"must":
[{
"term": {"file": "original"}
}]
}
}
,"size" : "0"}
)

total_o = origall['hits']['total']

origall_o = es.search('tns_survey_data','primary',
body = {"query":
{"bool":
{"must":
[{
"term": {"file": "original_amit2"}
}]
}
}
,"size" : 20

                }

)

force it to data frame

orig_dict = origall_o['hits']['hits']

############################################################
####### READ IN THE NEW SURVEY DATA ########################
############################################################

get the documents

newall = es.search('survey_data','primary',
{"query":
{
"bool":
{
"should":[
{
"term":{
"file":"destinationqc22"
}
},
{
"term":{
"file":"destinationqc33"
}
},
{
"term":{
"file":"destinationqc44"
}
}
]
}
}
,"size" : "0"
}
)

total_n = newall['hits']['total']

newall_n = es.search('tns_survey_data','primary',
{"query":
{
"bool":
{
"should":[
{
"term":{
"file":"destinationqc22"
}
},
{
"term":{
"file":"destinationqc33"
}
},
{
"term":{
"file":"destinationqc44"
}
}
]
}
}
,"size" : 20
}
)

force it to data frame

new_dict = newall_n['hits']['hits']

print(origall_o)
print(newall_n)

print orig_dict

print new_dict

And then I run it I get this:

print(origall_o)
{u'hits': {u'hits': [], u'total': 110950, u'max_score': 0.7038795},
u'_shards': {u'successful': 3, u'failed': 0, u'total': 3}, u'took': 15,
u'timed_out': False}

print(newall_n)
{u'hits': {u'hits': [], u'total': 110950, u'max_score': 0.7038795},
u'_shards': {u'successful': 3, u'failed': 0, u'total': 3}, u'took': 15,
u'timed_out': False}

print orig_dict
[]

print new_dict
[]

And what I would expect is:
origall_o total is correct (110k hits)
newall_n total should be 84k, not sure why it has the same 110k as for the
origall_o

And for the orig_dict and new_dict I would expect to see those 20 documents
that I query.

Many thanks for your help.

Geza

On Monday, January 13, 2014 12:16:53 PM UTC, Honza Král wrote:

Hi Geza,

I don't understand what you mean by re-running, can you post the complete
code?

When you do a search with size: 20, can you just print the result of
the search method and see if that data is there?

As a side note it looks like you are trying to filter out some data,
while this works with a query you will get much better performance
when using a filtered query and a filter instead of a query.

Honza

On Mon, Jan 13, 2014 at 10:38 AM, G Kerekes <kere...@gmail.com<javascript:>>
wrote:

Hello,

I am querying an elasticsearch index from python. Issue 1 is that when I
change my query and rerun it, my objects in Python don't get refreshed
according to my modified query. Issue 2 is that even if I see that I got
some hits, no data comes through at all (eg I see I've got 85k hits, but
when I put it in a dictionary, it is blank).

from elasticsearch import Elasticsearch

es = Elasticsearch("host:port", timeout=600, max_retries=10,
revival_delay=0)

origall = es.search('esdata' ,'primary',
{"query":
{"bool":
{"must_not":
[{
"term": {"file": "original"}
}]
}
}
,"size" : "0"}
)

total_o = origall['hits']['total']

At this stage for total_o I get 110k, which is correct. Then I rerun my
query after changing the size=0 to size=20, and if I want to have a look
at
these 20 hits, I get nothing for this:

orig = origall['hits']['hits']
print(orig)

Then I go back to my original query and change the must_not to must. In
this
way I should get 85k hits, but after rerunning it I still get 110k in
total_o.

It is quite random when it works and when it doesn't. Sometimes I get my
expected 85k hits, but then this get stuck and when I change my query
back
to get the 110k, it would still be 85k. Also sometimes I get data in my
orig
= origall['hits']['hits'], but then let's say I change the size in my
query
to 0, rerun it and the origall['hits']['hits'] will still give me back
the
data.

I use Anaconda, but tried also in Pycharm and the default Python IDLE,
these
behave the same. Tried to create separate ES connections for all my
queries,
doesn't help. Played around with cache, but no luck.

I'm running it on a 64 bit, Windows 7 machine.

Any idea what I'm doing wrong? Many thanks,

Geza

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/adf4f92a-59f3-4189-ab87-8a2c13de7022%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7d246577-1604-45e7-9858-c48f533e8f4f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Honza Král) #4

I am sorry, I don't see why it should match at all - you are searching
for different things in different indices.

On Mon, Jan 13, 2014 at 3:08 PM, G Kerekes kerekesg@gmail.com wrote:

Hi Honza,

This is my "full" code:

from elasticsearch import Elasticsearch
import json
import pandas as pd
import numpy as np
import os

create the connection to the ES

es = Elasticsearch("host:port", timeout=600, max_retries=10,
revival_delay=0)

############################################################
####### READ IN THE ORIGINAL SURVEY DATA ###################
############################################################

origall = es.search('survey_data' ,'primary',
body = {"query":
{"bool":
{"must":
[{
"term": {"file": "original"}
}]
}
}
,"size" : "0"}
)

total_o = origall['hits']['total']

origall_o = es.search('tns_survey_data','primary',
body = {"query":
{"bool":
{"must":
[{
"term": {"file": "original_amit2"}
}]
}
}
,"size" : 20

                }

)

force it to data frame

orig_dict = origall_o['hits']['hits']

############################################################
####### READ IN THE NEW SURVEY DATA ########################
############################################################

get the documents

newall = es.search('survey_data','primary',
{"query":
{
"bool":
{
"should":[
{
"term":{
"file":"destinationqc22"
}
},
{
"term":{
"file":"destinationqc33"
}
},
{
"term":{
"file":"destinationqc44"
}
}
]
}
}
,"size" : "0"
}
)

total_n = newall['hits']['total']

newall_n = es.search('tns_survey_data','primary',
{"query":
{
"bool":
{
"should":[
{
"term":{
"file":"destinationqc22"
}
},
{
"term":{
"file":"destinationqc33"
}
},
{
"term":{
"file":"destinationqc44"
}
}
]
}
}
,"size" : 20
}
)

force it to data frame

new_dict = newall_n['hits']['hits']

print(origall_o)
print(newall_n)

print orig_dict

print new_dict

And then I run it I get this:

print(origall_o)
{u'hits': {u'hits': [], u'total': 110950, u'max_score': 0.7038795},
u'_shards': {u'successful': 3, u'failed': 0, u'total': 3}, u'took': 15,
u'timed_out': False}

print(newall_n)
{u'hits': {u'hits': [], u'total': 110950, u'max_score': 0.7038795},
u'_shards': {u'successful': 3, u'failed': 0, u'total': 3}, u'took': 15,
u'timed_out': False}

print orig_dict
[]

print new_dict
[]

And what I would expect is:
origall_o total is correct (110k hits)
newall_n total should be 84k, not sure why it has the same 110k as for the
origall_o

And for the orig_dict and new_dict I would expect to see those 20 documents
that I query.

Many thanks for your help.

Geza

On Monday, January 13, 2014 12:16:53 PM UTC, Honza Král wrote:

Hi Geza,

I don't understand what you mean by re-running, can you post the complete
code?

When you do a search with size: 20, can you just print the result of
the search method and see if that data is there?

As a side note it looks like you are trying to filter out some data,
while this works with a query you will get much better performance
when using a filtered query and a filter instead of a query.

Honza

On Mon, Jan 13, 2014 at 10:38 AM, G Kerekes kere...@gmail.com wrote:

Hello,

I am querying an elasticsearch index from python. Issue 1 is that when I
change my query and rerun it, my objects in Python don't get refreshed
according to my modified query. Issue 2 is that even if I see that I got
some hits, no data comes through at all (eg I see I've got 85k hits, but
when I put it in a dictionary, it is blank).

from elasticsearch import Elasticsearch

es = Elasticsearch("host:port", timeout=600, max_retries=10,
revival_delay=0)

origall = es.search('esdata' ,'primary',
{"query":
{"bool":
{"must_not":
[{
"term": {"file": "original"}
}]
}
}
,"size" : "0"}
)

total_o = origall['hits']['total']

At this stage for total_o I get 110k, which is correct. Then I rerun my
query after changing the size=0 to size=20, and if I want to have a look
at
these 20 hits, I get nothing for this:

orig = origall['hits']['hits']
print(orig)

Then I go back to my original query and change the must_not to must. In
this
way I should get 85k hits, but after rerunning it I still get 110k in
total_o.

It is quite random when it works and when it doesn't. Sometimes I get my
expected 85k hits, but then this get stuck and when I change my query
back
to get the 110k, it would still be 85k. Also sometimes I get data in my
orig
= origall['hits']['hits'], but then let's say I change the size in my
query
to 0, rerun it and the origall['hits']['hits'] will still give me back
the
data.

I use Anaconda, but tried also in Pycharm and the default Python IDLE,
these
behave the same. Tried to create separate ES connections for all my
queries,
doesn't help. Played around with cache, but no luck.

I'm running it on a 64 bit, Windows 7 machine.

Any idea what I'm doing wrong? Many thanks,

Geza

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/adf4f92a-59f3-4189-ab87-8a2c13de7022%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7d246577-1604-45e7-9858-c48f533e8f4f%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CABfdDiqqfWCYEC-m3_0j-JYYmkmMTF-BfbKWniJROO-P%2B%2BdJCQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(G Kerekes) #5

Just noticed some typos in my code, please see the fixed one below (the
queried index and filter terms were not consistent)

On Monday, January 13, 2014 2:08:25 PM UTC, G Kerekes wrote:

Hi Honza,

This is my "full" code:

from elasticsearch import Elasticsearch
import json
import pandas as pd
import numpy as np
import os

create the connection to the ES

es = Elasticsearch("host:port", timeout=600, max_retries=10, revival_delay=0)

############################################################
####### READ IN THE ORIGINAL SURVEY DATA ###################
############################################################

origall = es.search('survey_data' ,'primary',
body = {"query":
{"bool":
{"must":
[{
"term": {"file": "original"}
}]
}
}
,"size" : "0"}
)

total_o = origall['hits']['total']

origall_o = es.search('survey_data','primary',
body = {"query":
{"bool":
{"must":
[{
"term": {"file": "original"}
}]
}
}
,"size" : 20

                }

)

force it to data frame

orig_dict = origall_o['hits']['hits']

############################################################
####### READ IN THE NEW SURVEY DATA ########################
############################################################

get the documents

newall = es.search('survey_data','primary',
{"query":
{
"bool":
{
"should":[
{
"term":{
"file":"destinationqc22"
}
},
{
"term":{
"file":"destinationqc33"
}
},
{
"term":{
"file":"destinationqc44"
}
}
]
}
}
,"size" : "0"
}
)

total_n = newall['hits']['total']

newall_n = es.search('survey_data','primary',
{"query":
{
"bool":
{
"should":[
{
"term":{
"file":"destinationqc22"
}
},
{
"term":{
"file":"destinationqc33"
}
},
{
"term":{
"file":"destinationqc44"
}
}
]
}
}
,"size" : 20
}
)

force it to data frame

new_dict = newall_n['hits']['hits']

print(origall_o)
print(newall_n)

print orig_dict

print new_dict

And then I run it I get this:

print(origall_o)
{u'hits': {u'hits': [], u'total': 110950, u'max_score': 0.7038795},
u'_shards': {u'successful': 3, u'failed': 0, u'total': 3}, u'took': 15,
u'timed_out': False}

print(newall_n)
{u'hits': {u'hits': [], u'total': 110950, u'max_score': 0.7038795},
u'_shards': {u'successful': 3, u'failed': 0, u'total': 3}, u'took': 15,
u'timed_out': False}

print orig_dict
[]

print new_dict
[]

And what I would expect is:
origall_o total is correct (110k hits)
newall_n total should be 84k, not sure why it has the same 110k as for the
origall_o

And for the orig_dict and new_dict I would expect to see those 20
documents that I query.

Many thanks for your help.

Geza

On Monday, January 13, 2014 12:16:53 PM UTC, Honza Král wrote:

Hi Geza,

I don't understand what you mean by re-running, can you post the complete
code?

When you do a search with size: 20, can you just print the result of
the search method and see if that data is there?

As a side note it looks like you are trying to filter out some data,
while this works with a query you will get much better performance
when using a filtered query and a filter instead of a query.

Honza

On Mon, Jan 13, 2014 at 10:38 AM, G Kerekes kere...@gmail.com wrote:

Hello,

I am querying an elasticsearch index from python. Issue 1 is that when
I
change my query and rerun it, my objects in Python don't get refreshed
according to my modified query. Issue 2 is that even if I see that I
got
some hits, no data comes through at all (eg I see I've got 85k hits,
but
when I put it in a dictionary, it is blank).

from elasticsearch import Elasticsearch

es = Elasticsearch("host:port", timeout=600, max_retries=10,
revival_delay=0)

origall = es.search('esdata' ,'primary',
{"query":
{"bool":
{"must_not":
[{
"term": {"file": "original"}
}]
}
}
,"size" : "0"}
)

total_o = origall['hits']['total']

At this stage for total_o I get 110k, which is correct. Then I rerun my
query after changing the size=0 to size=20, and if I want to have a
look at
these 20 hits, I get nothing for this:

orig = origall['hits']['hits']
print(orig)

Then I go back to my original query and change the must_not to must. In
this
way I should get 85k hits, but after rerunning it I still get 110k in
total_o.

It is quite random when it works and when it doesn't. Sometimes I get
my
expected 85k hits, but then this get stuck and when I change my query
back
to get the 110k, it would still be 85k. Also sometimes I get data in my
orig
= origall['hits']['hits'], but then let's say I change the size in my
query
to 0, rerun it and the origall['hits']['hits'] will still give me back
the
data.

I use Anaconda, but tried also in Pycharm and the default Python IDLE,
these
behave the same. Tried to create separate ES connections for all my
queries,
doesn't help. Played around with cache, but no luck.

I'm running it on a 64 bit, Windows 7 machine.

Any idea what I'm doing wrong? Many thanks,

Geza

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/adf4f92a-59f3-4189-ab87-8a2c13de7022%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2a1eed86-eb4f-4459-93d1-a45ed499cc8a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(G Kerekes) #6

Sorry Honza, I tried to somewhat anonimize my code but I was not
consistent. Basically I always query the same index, and my filter terms
are also consistent (original = original_amit2 and tns_survey_data =
survey_data).

2014/1/13 G Kerekes kerekesg@gmail.com

Just noticed some typos in my code, please see the fixed one below (the
queried index and filter terms were not consistent)

On Monday, January 13, 2014 2:08:25 PM UTC, G Kerekes wrote:

Hi Honza,

This is my "full" code:

from elasticsearch import Elasticsearch
import json
import pandas as pd
import numpy as np
import os

create the connection to the ES

es = Elasticsearch("host:port", timeout=600, max_retries=10, revival_delay=0)

############################################################
####### READ IN THE ORIGINAL SURVEY DATA ###################
##############################
##############################

origall = es.search('survey_data' ,'primary',
body = {"query":
{"bool":
{"must":
[{
"term": {"file": "original"}
}]
}
}
,"size" : "0"}
)

total_o = origall['hits']['total']

origall_o = es.search('survey_data','primary',
body = {"query":
{"bool":
{"must":
[{
"term": {"file": "original"}
}]
}
}
,"size" : 20

                }

)

force it to data frame

orig_dict = origall_o['hits']['hits']

############################################################
####### READ IN THE NEW SURVEY DATA ########################
############################################################

get the documents

newall = es.search('survey_data','primary',
{"query":
{
"bool":
{
"should":[
{
"term":{
"file":"destinationqc22"
}
},
{
"term":{
"file":"destinationqc33"
}
},
{
"term":{
"file":"destinationqc44"
}
}
]
}
}
,"size" : "0"
}
)

total_n = newall['hits']['total']

newall_n = es.search('survey_data','primary',
{"query":
{
"bool":
{
"should":[
{
"term":{
"file":"destinationqc22"
}
},
{
"term":{
"file":"destinationqc33"
}
},
{
"term":{
"file":"destinationqc44"
}
}
]
}
}
,"size" : 20
}
)

force it to data frame

new_dict = newall_n['hits']['hits']

print(origall_o)
print(newall_n)

print orig_dict

print new_dict

And then I run it I get this:

print(origall_o)
{u'hits': {u'hits': [], u'total': 110950, u'max_score': 0.7038795},
u'_shards': {u'successful': 3, u'failed': 0, u'total': 3}, u'took': 15,
u'timed_out': False}

print(newall_n)
{u'hits': {u'hits': [], u'total': 110950, u'max_score': 0.7038795},
u'_shards': {u'successful': 3, u'failed': 0, u'total': 3}, u'took': 15,
u'timed_out': False}

print orig_dict
[]

print new_dict
[]

And what I would expect is:
origall_o total is correct (110k hits)
newall_n total should be 84k, not sure why it has the same 110k as for
the origall_o

And for the orig_dict and new_dict I would expect to see those 20
documents that I query.

Many thanks for your help.

Geza

On Monday, January 13, 2014 12:16:53 PM UTC, Honza Král wrote:

Hi Geza,

I don't understand what you mean by re-running, can you post the
complete code?

When you do a search with size: 20, can you just print the result of
the search method and see if that data is there?

As a side note it looks like you are trying to filter out some data,
while this works with a query you will get much better performance
when using a filtered query and a filter instead of a query.

Honza

On Mon, Jan 13, 2014 at 10:38 AM, G Kerekes kere...@gmail.com wrote:

Hello,

I am querying an elasticsearch index from python. Issue 1 is that when
I
change my query and rerun it, my objects in Python don't get refreshed
according to my modified query. Issue 2 is that even if I see that I
got
some hits, no data comes through at all (eg I see I've got 85k hits,
but
when I put it in a dictionary, it is blank).

from elasticsearch import Elasticsearch

es = Elasticsearch("host:port", timeout=600, max_retries=10,
revival_delay=0)

origall = es.search('esdata' ,'primary',
{"query":
{"bool":
{"must_not":
[{
"term": {"file": "original"}
}]
}
}
,"size" : "0"}
)

total_o = origall['hits']['total']

At this stage for total_o I get 110k, which is correct. Then I rerun
my
query after changing the size=0 to size=20, and if I want to have a
look at
these 20 hits, I get nothing for this:

orig = origall['hits']['hits']
print(orig)

Then I go back to my original query and change the must_not to must.
In this
way I should get 85k hits, but after rerunning it I still get 110k in
total_o.

It is quite random when it works and when it doesn't. Sometimes I get
my
expected 85k hits, but then this get stuck and when I change my query
back
to get the 110k, it would still be 85k. Also sometimes I get data in
my orig
= origall['hits']['hits'], but then let's say I change the size in my
query
to 0, rerun it and the origall['hits']['hits'] will still give me back
the
data.

I use Anaconda, but tried also in Pycharm and the default Python IDLE,
these
behave the same. Tried to create separate ES connections for all my
queries,
doesn't help. Played around with cache, but no luck.

I'm running it on a 64 bit, Windows 7 machine.

Any idea what I'm doing wrong? Many thanks,

Geza

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/adf4f92a-
59f3-4189-ab87-8a2c13de7022%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Ld5XwSVP6ik/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2a1eed86-eb4f-4459-93d1-a45ed499cc8a%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEJuwWXhtXEPxVTPuR4x4HHV0ZO3bMsSxMeK7ZfvNHSWBSkyGw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Honza Král) #7

I can't replicate your problem, for me it all works. Could you please
isolate a working example that reproduces your behavior? Thanks

from elasticsearch import Elasticsearch
es = Elasticsearch()
es.index(index='i', doc_type='t', id=42, body={'hello': 'world'})
es.index(index='i', doc_type='t', id=47, body={'hello': 'universe'})
es.indices.refresh()
es.search(index='i', doc_type='t', body={"query": {"match_all": {}}, "size": 0})
es.search(index='i', doc_type='t', body={"query": {"match_all": {}}, "size": 1})

works just fine for me

On Mon, Jan 13, 2014 at 3:23 PM, G Kerekes kerekesg@gmail.com wrote:

Sorry Honza, I tried to somewhat anonimize my code but I was not consistent.
Basically I always query the same index, and my filter terms are also
consistent (original = original_amit2 and tns_survey_data = survey_data).

2014/1/13 G Kerekes kerekesg@gmail.com

Just noticed some typos in my code, please see the fixed one below (the
queried index and filter terms were not consistent)

On Monday, January 13, 2014 2:08:25 PM UTC, G Kerekes wrote:

Hi Honza,

This is my "full" code:

from elasticsearch import Elasticsearch
import json
import pandas as pd
import numpy as np
import os

create the connection to the ES

es = Elasticsearch("host:port", timeout=600, max_retries=10,
revival_delay=0)

############################################################
####### READ IN THE ORIGINAL SURVEY DATA ###################
##############################
##############################

origall = es.search('survey_data' ,'primary',
body = {"query":
{"bool":
{"must":
[{
"term": {"file": "original"}
}]
}
}
,"size" : "0"}
)

total_o = origall['hits']['total']

origall_o = es.search('survey_data','primary',
body = {"query":
{"bool":
{"must":
[{
"term": {"file": "original"}
}]
}
}
,"size" : 20

                }

)

force it to data frame

orig_dict = origall_o['hits']['hits']

############################################################
####### READ IN THE NEW SURVEY DATA ########################
############################################################

get the documents

newall = es.search('survey_data','primary',
{"query":
{
"bool":
{
"should":[
{
"term":{
"file":"destinationqc22"
}
},
{
"term":{
"file":"destinationqc33"
}
},
{
"term":{
"file":"destinationqc44"
}
}
]
}
}
,"size" : "0"
}
)

total_n = newall['hits']['total']

newall_n = es.search('survey_data','primary',
{"query":
{
"bool":
{
"should":[
{
"term":{
"file":"destinationqc22"
}
},
{
"term":{
"file":"destinationqc33"
}
},
{
"term":{
"file":"destinationqc44"
}
}
]
}
}
,"size" : 20
}
)

force it to data frame

new_dict = newall_n['hits']['hits']

print(origall_o)
print(newall_n)

print orig_dict

print new_dict

And then I run it I get this:

print(origall_o)
{u'hits': {u'hits': [], u'total': 110950, u'max_score': 0.7038795},
u'_shards': {u'successful': 3, u'failed': 0, u'total': 3}, u'took': 15,
u'timed_out': False}

print(newall_n)
{u'hits': {u'hits': [], u'total': 110950, u'max_score': 0.7038795},
u'_shards': {u'successful': 3, u'failed': 0, u'total': 3}, u'took': 15,
u'timed_out': False}

print orig_dict
[]

print new_dict
[]

And what I would expect is:
origall_o total is correct (110k hits)
newall_n total should be 84k, not sure why it has the same 110k as for
the origall_o

And for the orig_dict and new_dict I would expect to see those 20
documents that I query.

Many thanks for your help.

Geza

On Monday, January 13, 2014 12:16:53 PM UTC, Honza Král wrote:

Hi Geza,

I don't understand what you mean by re-running, can you post the
complete code?

When you do a search with size: 20, can you just print the result of
the search method and see if that data is there?

As a side note it looks like you are trying to filter out some data,
while this works with a query you will get much better performance
when using a filtered query and a filter instead of a query.

Honza

On Mon, Jan 13, 2014 at 10:38 AM, G Kerekes kere...@gmail.com wrote:

Hello,

I am querying an elasticsearch index from python. Issue 1 is that when
I
change my query and rerun it, my objects in Python don't get refreshed
according to my modified query. Issue 2 is that even if I see that I
got
some hits, no data comes through at all (eg I see I've got 85k hits,
but
when I put it in a dictionary, it is blank).

from elasticsearch import Elasticsearch

es = Elasticsearch("host:port", timeout=600, max_retries=10,
revival_delay=0)

origall = es.search('esdata' ,'primary',
{"query":
{"bool":
{"must_not":
[{
"term": {"file": "original"}
}]
}
}
,"size" : "0"}
)

total_o = origall['hits']['total']

At this stage for total_o I get 110k, which is correct. Then I rerun
my
query after changing the size=0 to size=20, and if I want to have a
look at
these 20 hits, I get nothing for this:

orig = origall['hits']['hits']
print(orig)

Then I go back to my original query and change the must_not to must.
In this
way I should get 85k hits, but after rerunning it I still get 110k in
total_o.

It is quite random when it works and when it doesn't. Sometimes I get
my
expected 85k hits, but then this get stuck and when I change my query
back
to get the 110k, it would still be 85k. Also sometimes I get data in
my orig
= origall['hits']['hits'], but then let's say I change the size in my
query
to 0, rerun it and the origall['hits']['hits'] will still give me back
the
data.

I use Anaconda, but tried also in Pycharm and the default Python IDLE,
these
behave the same. Tried to create separate ES connections for all my
queries,
doesn't help. Played around with cache, but no luck.

I'm running it on a 64 bit, Windows 7 machine.

Any idea what I'm doing wrong? Many thanks,

Geza

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/adf4f92a-59f3-4189-ab87-8a2c13de7022%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Ld5XwSVP6ik/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2a1eed86-eb4f-4459-93d1-a45ed499cc8a%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEJuwWXhtXEPxVTPuR4x4HHV0ZO3bMsSxMeK7ZfvNHSWBSkyGw%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CABfdDir-viyzcCM28PXEX0ki5S%2B3P6rDYo9gShn7UJPLKXvbaQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(G Kerekes) #8

I have been trying to, but it's difficult. I tried to run it on 4 different
machines in the office, and it all works fine on 2, but it doesn't on mine
and 1 more. I looked at the http calls my machine is making while running
the code and it seems that there are no calls going out to ES. Not sure
what causes it, but seem like a python/machine issue rather than ES (when I
query ES from the Sense plugin in chrome I always get the correct hits).
But thanks for trying anyway.

Geza

On Monday, January 13, 2014 2:35:57 PM UTC, Honza Král wrote:

I can't replicate your problem, for me it all works. Could you please
isolate a working example that reproduces your behavior? Thanks

from elasticsearch import Elasticsearch
es = Elasticsearch()
es.index(index='i', doc_type='t', id=42, body={'hello': 'world'})
es.index(index='i', doc_type='t', id=47, body={'hello': 'universe'})
es.indices.refresh()
es.search(index='i', doc_type='t', body={"query": {"match_all": {}},
"size": 0})
es.search(index='i', doc_type='t', body={"query": {"match_all": {}},
"size": 1})

works just fine for me

On Mon, Jan 13, 2014 at 3:23 PM, G Kerekes <kere...@gmail.com<javascript:>>
wrote:

Sorry Honza, I tried to somewhat anonimize my code but I was not
consistent.
Basically I always query the same index, and my filter terms are also
consistent (original = original_amit2 and tns_survey_data =
survey_data).

2014/1/13 G Kerekes <kere...@gmail.com <javascript:>>

Just noticed some typos in my code, please see the fixed one below (the
queried index and filter terms were not consistent)

On Monday, January 13, 2014 2:08:25 PM UTC, G Kerekes wrote:

Hi Honza,

This is my "full" code:

from elasticsearch import Elasticsearch
import json
import pandas as pd
import numpy as np
import os

create the connection to the ES

es = Elasticsearch("host:port", timeout=600, max_retries=10,
revival_delay=0)

############################################################
####### READ IN THE ORIGINAL SURVEY DATA ###################
##############################
##############################

origall = es.search('survey_data' ,'primary',
body = {"query":
{"bool":
{"must":
[{
"term": {"file": "original"}
}]
}
}
,"size" : "0"}
)

total_o = origall['hits']['total']

origall_o = es.search('survey_data','primary',
body = {"query":
{"bool":
{"must":
[{
"term": {"file": "original"}
}]
}
}
,"size" : 20

                } 

)

force it to data frame

orig_dict = origall_o['hits']['hits']

############################################################
####### READ IN THE NEW SURVEY DATA ########################
############################################################

get the documents

newall = es.search('survey_data','primary',
{"query":
{
"bool":
{
"should":[
{
"term":{
"file":"destinationqc22"
}
},
{
"term":{
"file":"destinationqc33"
}
},
{
"term":{
"file":"destinationqc44"
}
}
]
}
}
,"size" : "0"
}
)

total_n = newall['hits']['total']

newall_n = es.search('survey_data','primary',
{"query":
{
"bool":
{
"should":[
{
"term":{
"file":"destinationqc22"
}
},
{
"term":{
"file":"destinationqc33"
}
},
{
"term":{
"file":"destinationqc44"
}
}
]
}
}
,"size" : 20
}
)

force it to data frame

new_dict = newall_n['hits']['hits']

print(origall_o)
print(newall_n)

print orig_dict

print new_dict

And then I run it I get this:

print(origall_o)
{u'hits': {u'hits': [], u'total': 110950, u'max_score': 0.7038795},
u'_shards': {u'successful': 3, u'failed': 0, u'total': 3}, u'took':
15,

u'timed_out': False}

print(newall_n)
{u'hits': {u'hits': [], u'total': 110950, u'max_score': 0.7038795},
u'_shards': {u'successful': 3, u'failed': 0, u'total': 3}, u'took':
15,

u'timed_out': False}

print orig_dict
[]

print new_dict
[]

And what I would expect is:
origall_o total is correct (110k hits)
newall_n total should be 84k, not sure why it has the same 110k as for
the origall_o

And for the orig_dict and new_dict I would expect to see those 20
documents that I query.

Many thanks for your help.

Geza

On Monday, January 13, 2014 12:16:53 PM UTC, Honza Král wrote:

Hi Geza,

I don't understand what you mean by re-running, can you post the
complete code?

When you do a search with size: 20, can you just print the result of
the search method and see if that data is there?

As a side note it looks like you are trying to filter out some data,
while this works with a query you will get much better performance
when using a filtered query and a filter instead of a query.

Honza

On Mon, Jan 13, 2014 at 10:38 AM, G Kerekes kere...@gmail.com
wrote:

Hello,

I am querying an elasticsearch index from python. Issue 1 is that
when

I
change my query and rerun it, my objects in Python don't get
refreshed

according to my modified query. Issue 2 is that even if I see that
I

got
some hits, no data comes through at all (eg I see I've got 85k
hits,

but
when I put it in a dictionary, it is blank).

from elasticsearch import Elasticsearch

es = Elasticsearch("host:port", timeout=600, max_retries=10,
revival_delay=0)

origall = es.search('esdata' ,'primary',
{"query":
{"bool":
{"must_not":
[{
"term": {"file": "original"}
}]
}
}
,"size" : "0"}
)

total_o = origall['hits']['total']

At this stage for total_o I get 110k, which is correct. Then I
rerun

my
query after changing the size=0 to size=20, and if I want to have a
look at
these 20 hits, I get nothing for this:

orig = origall['hits']['hits']
print(orig)

Then I go back to my original query and change the must_not to
must.

In this
way I should get 85k hits, but after rerunning it I still get 110k
in

total_o.

It is quite random when it works and when it doesn't. Sometimes I
get

my
expected 85k hits, but then this get stuck and when I change my
query

back
to get the 110k, it would still be 85k. Also sometimes I get data
in

my orig
= origall['hits']['hits'], but then let's say I change the size in
my

query
to 0, rerun it and the origall['hits']['hits'] will still give me
back

the
data.

I use Anaconda, but tried also in Pycharm and the default Python
IDLE,

these
behave the same. Tried to create separate ES connections for all my
queries,
doesn't help. Played around with cache, but no luck.

I'm running it on a 64 bit, Windows 7 machine.

Any idea what I'm doing wrong? Many thanks,

Geza

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send

an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/adf4f92a-59f3-4189-ab87-8a2c13de7022%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Ld5XwSVP6ik/unsubscribe.

To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/2a1eed86-eb4f-4459-93d1-a45ed499cc8a%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/CAEJuwWXhtXEPxVTPuR4x4HHV0ZO3bMsSxMeK7ZfvNHSWBSkyGw%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a7525aba-63a0-4338-b39c-8e0ef6f463dc%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #9