Pyes related question : performance related


(Abhishek Pratap) #1

Hi Guys

Another strange behavior in terms of performance. If I am time the
following code I get a good performance..about about 100k search results in
2 seconds.

conn = ES(['128.55.54.149:9200','128.55.54.149:9201'],timeout=20)
q1 = TermQuery("tax_name",query.lower().strip())
results = conn.search(query=q1)

However if I try to retrieve the top search result using result object, the
performance goes down by about 100 times.

conn = ES(['128.55.54.149:9200','128.55.54.149:9201'],timeout=20)
q1 = TermQuery("tax_name",query.lower().strip())
results = conn.search(query=q1)
#getting the tophit
if results:
return results[0]

I would think once the search is done and the results are returned in a
result object one would expect the post processing to take a negligible
overhead..I am not seeing it. Anything I am messing up ?

thanks!
-Abhi

--


(Dan Fairs) #2

conn = ES(['128.55.54.149:9200','128.55.54.149:9201'],timeout=20)
q1 = TermQuery("tax_name",query.lower().strip())
results = conn.search(query=q1)

However if I try to retrieve the top search result using result object, the performance goes down by about 100 times.

conn = ES(['128.55.54.149:9200','128.55.54.149:9201'],timeout=20)
q1 = TermQuery("tax_name",query.lower().strip())
results = conn.search(query=q1)
#getting the tophit
if results:
return results[0]

I would think once the search is done and the results are returned in a result object one would expect the post processing to take a negligible overhead..I am not seeing it. Anything I am messing up ?

iirc, pyes fetches results lazily. That is, it won't actually execute the search until you start doing anything with 'results'. If you dig a bit deeper, you'll probably find that your search isn't actually being executed at all in your first example.

Cheers,
Dan

Dan Fairs | dan.fairs@gmail.com | @danfairs | www.fezconsulting.com

--


(Abhishek Pratap) #3

Thanks Dan. In that case I would say that the performance I am getting is
pretty low compared to what I would expect. Currently I am getting 1000
search hits in 2-3 second interval. Can I do anything to improve this..Also
what would be a optimum number once could get with a ES implementation.

With the tweaks suggested by Radu in an earlier thread I was able to index
12-15K records per second and expected to get 100k results per second
during searching.

-Abhi

On Wednesday, August 15, 2012 1:45:22 PM UTC-7, Dan Fairs wrote:

conn = ES(['128.55.54.149:9200','128.55.54.149:9201'],timeout=20)
q1 = TermQuery("tax_name",query.lower().strip())
results = conn.search(query=q1)

However if I try to retrieve the top search result using result object,
the performance goes down by about 100 times.

conn = ES(['128.55.54.149:9200','128.55.54.149:9201'],timeout=20)
q1 = TermQuery("tax_name",query.lower().strip())
results = conn.search(query=q1)
#getting the tophit
if results:
return results[0]

I would think once the search is done and the results are returned in a
result object one would expect the post processing to take a negligible
overhead..I am not seeing it. Anything I am messing up ?

iirc, pyes fetches results lazily. That is, it won't actually execute the
search until you start doing anything with 'results'. If you dig a bit
deeper, you'll probably find that your search isn't actually being executed
at all in your first example.

Cheers,
Dan

Dan Fairs | dan....@gmail.com <javascript:> | @danfairs |
www.fezconsulting.com

--


(Abhishek Pratap) #4

Apologies for pushing it once more. Can any one help me figure out why my
search queries on ES are slow. I am able to get 1000 results in 2-3
seconds(details in the first post of this thread). I would expect the
performance to be atleast 50-100 times more faster than this. I hope thats
a realistic expectation.

best,
-Abhi

On Wednesday, August 15, 2012 2:20:04 PM UTC-7, Abhishek Pratap wrote:

Thanks Dan. In that case I would say that the performance I am getting is
pretty low compared to what I would expect. Currently I am getting 1000
search hits in 2-3 second interval. Can I do anything to improve this..Also
what would be a optimum number once could get with a ES implementation.

With the tweaks suggested by Radu in an earlier thread I was able to index
12-15K records per second and expected to get 100k results per second
during searching.

-Abhi

On Wednesday, August 15, 2012 1:45:22 PM UTC-7, Dan Fairs wrote:

conn = ES(['128.55.54.149:9200','128.55.54.149:9201'],timeout=20)
q1 = TermQuery("tax_name",query.lower().strip())
results = conn.search(query=q1)

However if I try to retrieve the top search result using result object,
the performance goes down by about 100 times.

conn = ES(['128.55.54.149:9200','128.55.54.149:9201'],timeout=20)
q1 = TermQuery("tax_name",query.lower().strip())
results = conn.search(query=q1)
#getting the tophit
if results:
return results[0]

I would think once the search is done and the results are returned in a
result object one would expect the post processing to take a negligible
overhead..I am not seeing it. Anything I am messing up ?

iirc, pyes fetches results lazily. That is, it won't actually execute the
search until you start doing anything with 'results'. If you dig a bit
deeper, you'll probably find that your search isn't actually being executed
at all in your first example.

Cheers,
Dan

Dan Fairs | dan....@gmail.com | @danfairs | www.fezconsulting.com

--


(Anton2) #5

Be sure to use the latest version of pyes and requests, or the development version of pyes from github (which goes back to use urllib3).
We had troubles with certain combination of pyes and requests a few month ago. Requests was fetching data from the network one byte at a time, with abysmal performances…

Apologies for pushing it once more. Can any one help me figure out why my search queries on ES are slow. I am able to get 1000 results in 2-3 seconds(details in the first post of this thread). I would expect the performance to be atleast 50-100 times more faster than this. I hope thats a realistic expectation.

best,
-Abhi

--


(Abhishek Pratap) #6

Hi Anton

I have upgraded the requests module. Request version 0.13.8 and pyes
version 0.19..

Still able to make only about 1000 searches per 2-3 second period

-Abhi

On Thursday, August 16, 2012 2:43:28 PM UTC-7, Anton2 wrote:

Be sure to use the latest version of pyes and requests, or the development
version of pyes from github (which goes back to use urllib3).
We had troubles with certain combination of pyes and requests a few month
ago. Requests was fetching data from the network one byte at a time, with
abysmal performances…

Apologies for pushing it once more. Can any one help me figure out why my
search queries on ES are slow. I am able to get 1000 results in 2-3
seconds(details in the first post of this thread). I would expect the
performance to be atleast 50-100 times more faster than this. I hope thats
a realistic expectation.

best,
-Abhi

--


(Abhishek Pratap) #7

Hi guys

Sorry I will have to push this again. I am still not able to get an
optimum performance from ES for searches.

my index contains 1.5 million records and I am able to make 800-1000
searches in 2 seconds using pyes. It has been a while since we are trying
to optimize ES through ES for our production work. Any help now will be
appreciated.

Best,
-Abhi

On Monday, August 20, 2012 9:48:31 AM UTC-7, Abhishek Pratap wrote:

Hi Anton

I have upgraded the requests module. Request version 0.13.8 and pyes
version 0.19..

Still able to make only about 1000 searches per 2-3 second period

-Abhi

On Thursday, August 16, 2012 2:43:28 PM UTC-7, Anton2 wrote:

Be sure to use the latest version of pyes and requests, or the
development version of pyes from github (which goes back to use urllib3).
We had troubles with certain combination of pyes and requests a few month
ago. Requests was fetching data from the network one byte at a time, with
abysmal performances…

Apologies for pushing it once more. Can any one help me figure out why my
search queries on ES are slow. I am able to get 1000 results in 2-3
seconds(details in the first post of this thread). I would expect the
performance to be atleast 50-100 times more faster than this. I hope thats
a realistic expectation.

best,
-Abhi

--


(Abhishek Pratap) #8

And just in case anyone is interested this is how I am testing the search
performance

loop_start = time.clock()
q1 = TermQuery("tax_name","cellvibrio")
for x in xrange(1000000):
if x % 1000 == 0 and x > 0:
loop_check_point = time.clock()
print 'took %s secs to search %d records' %
(loop_check_point-loop_start,x)

results = conn.search(query=q1)
if results:
    for r in results:
        pass

print len(results)

else:
    pass

-Abhi

On Wednesday, August 22, 2012 10:27:53 AM UTC-7, Abhishek Pratap wrote:

Hi guys

Sorry I will have to push this again. I am still not able to get an
optimum performance from ES for searches.

my index contains 1.5 million records and I am able to make 800-1000
searches in 2 seconds using pyes. It has been a while since we are trying
to optimize ES through ES for our production work. Any help now will be
appreciated.

Best,
-Abhi

On Monday, August 20, 2012 9:48:31 AM UTC-7, Abhishek Pratap wrote:

Hi Anton

I have upgraded the requests module. Request version 0.13.8 and pyes
version 0.19..

Still able to make only about 1000 searches per 2-3 second period

-Abhi

On Thursday, August 16, 2012 2:43:28 PM UTC-7, Anton2 wrote:

Be sure to use the latest version of pyes and requests, or the
development version of pyes from github (which goes back to use urllib3).
We had troubles with certain combination of pyes and requests a few
month ago. Requests was fetching data from the network one byte at a time,
with abysmal performances…

Apologies for pushing it once more. Can any one help me figure out why
my search queries on ES are slow. I am able to get 1000 results in 2-3
seconds(details in the first post of this thread). I would expect the
performance to be atleast 50-100 times more faster than this. I hope thats
a realistic expectation.

best,
-Abhi

--


(Abhishek Pratap) #9

Guys I am stuck and need some guidance in order to move fwd with ES and use
it.

Mainly the bottle neck is #search queries I am able to make ( 1000 queries
in 2-3 seconds). Can this be scaled up ?

I have also asked this on stackoverflow but dint get any response.

Thanks!
-Abhi

On Wednesday, August 22, 2012 11:13:58 AM UTC-7, Abhishek Pratap wrote:

And just in case anyone is interested this is how I am testing the search
performance

loop_start = time.clock()
q1 = TermQuery("tax_name","cellvibrio")
for x in xrange(1000000):
if x % 1000 == 0 and x > 0:
loop_check_point = time.clock()
print 'took %s secs to search %d records' %
(loop_check_point-loop_start,x)

results = conn.search(query=q1)
if results:
    for r in results:
        pass

print len(results)

else:
    pass

-Abhi

On Wednesday, August 22, 2012 10:27:53 AM UTC-7, Abhishek Pratap wrote:

Hi guys

Sorry I will have to push this again. I am still not able to get an
optimum performance from ES for searches.

my index contains 1.5 million records and I am able to make 800-1000
searches in 2 seconds using pyes. It has been a while since we are trying
to optimize ES through ES for our production work. Any help now will be
appreciated.

Best,
-Abhi

On Monday, August 20, 2012 9:48:31 AM UTC-7, Abhishek Pratap wrote:

Hi Anton

I have upgraded the requests module. Request version 0.13.8 and pyes
version 0.19..

Still able to make only about 1000 searches per 2-3 second period

-Abhi

On Thursday, August 16, 2012 2:43:28 PM UTC-7, Anton2 wrote:

Be sure to use the latest version of pyes and requests, or the
development version of pyes from github (which goes back to use urllib3).
We had troubles with certain combination of pyes and requests a few
month ago. Requests was fetching data from the network one byte at a time,
with abysmal performances…

Apologies for pushing it once more. Can any one help me figure out why
my search queries on ES are slow. I am able to get 1000 results in 2-3
seconds(details in the first post of this thread). I would expect the
performance to be atleast 50-100 times more faster than this. I hope thats
a realistic expectation.

best,
-Abhi

--


(David Pilato) #10

Just wondering if your test is correct. I mean: what do you want to test? If ES can deal with 1000 search requests?
If it's your question, you should parallelize your tests. In java, you should create more test Threads. I don't know with python and pyes.

It seems that with your test, you create 1000 calls, one by one. Just like if you were creating 1000 curl http://www.google.com and see how long it takes...

My 2 cents.

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 23 août 2012 à 23:25, Abhishek Pratap abhishek.vit@gmail.com a écrit :

Guys I am stuck and need some guidance in order to move fwd with ES and use it.

Mainly the bottle neck is #search queries I am able to make ( 1000 queries in 2-3 seconds). Can this be scaled up ?

I have also asked this on stackoverflow but dint get any response.

Thanks!
-Abhi

On Wednesday, August 22, 2012 11:13:58 AM UTC-7, Abhishek Pratap wrote:
And just in case anyone is interested this is how I am testing the search performance

loop_start = time.clock()
q1 = TermQuery("tax_name","cellvibrio")
for x in xrange(1000000):
if x % 1000 == 0 and x > 0:
loop_check_point = time.clock()
print 'took %s secs to search %d records' % (loop_check_point-loop_start,x)

results = conn.search(query=q1)
if results:
    for r in results:
        pass

print len(results)

else:
    pass

-Abhi

On Wednesday, August 22, 2012 10:27:53 AM UTC-7, Abhishek Pratap wrote:
Hi guys

Sorry I will have to push this again. I am still not able to get an optimum performance from ES for searches.

my index contains 1.5 million records and I am able to make 800-1000 searches in 2 seconds using pyes. It has been a while since we are trying to optimize ES through ES for our production work. Any help now will be appreciated.

Best,
-Abhi

On Monday, August 20, 2012 9:48:31 AM UTC-7, Abhishek Pratap wrote:
Hi Anton

I have upgraded the requests module. Request version 0.13.8 and pyes version 0.19..

Still able to make only about 1000 searches per 2-3 second period

-Abhi

On Thursday, August 16, 2012 2:43:28 PM UTC-7, Anton2 wrote:
Be sure to use the latest version of pyes and requests, or the development version of pyes from github (which goes back to use urllib3).
We had troubles with certain combination of pyes and requests a few month ago. Requests was fetching data from the network one byte at a time, with abysmal performances…

Apologies for pushing it once more. Can any one help me figure out why my search queries on ES are slow. I am able to get 1000 results in 2-3 seconds(details in the first post of this thread). I would expect the performance to be atleast 50-100 times more faster than this. I hope thats a realistic expectation.

best,
-Abhi

--

--


(Clinton Gormley) #11

Hi Abhi

On Thu, 2012-08-23 at 14:25 -0700, Abhishek Pratap wrote:

Guys I am stuck and need some guidance in order to move fwd with ES
and use it.

Mainly the bottle neck is #search queries I am able to make ( 1000
queries in 2-3 seconds). Can this be scaled up ?

You haven't provided any info about what hardware you are running this
on. You're currently getting a search result every 2ms. 500 queries per
second may be a good result.

We don't know. We don't know what queries you're running, how you're
running your queries, what your data looks like, how much RAM you have,
what CPU you have, what disks, etc

As such, it's very difficult to tell you whether you can get better
results with your current setup.

clint

--


(Abhishek Pratap) #12

Hi Clint and David

Thanks for your reply. I dint think about parallelizing the code as I was
interested in seeing the best I could do with one thread and then may be
use multiple threads.

Here are the answers to the questions Clint asked.

ES is running on a debian5-2 OS, x86_64 16 cores with 132 Gb of ram. The
filesystem being used is IBM GPFS.

The data I indexed had the following structure
{
"tax_id" : 45
"taxa_name" : Mycocosm
}

About 1.5 million such records were indexed into ES

The query I am making is basically
{"taxa_name":"mycocosm"}

Once I am able to get good performance(I am not sure how good is good in
terms of ES) the plan is to insert 500 million such records and query them.

Thanks!
-Abhi

On Thursday, August 23, 2012 2:37:47 PM UTC-7, Clinton Gormley wrote:

Hi Abhi

On Thu, 2012-08-23 at 14:25 -0700, Abhishek Pratap wrote:

Guys I am stuck and need some guidance in order to move fwd with ES
and use it.

Mainly the bottle neck is #search queries I am able to make ( 1000
queries in 2-3 seconds). Can this be scaled up ?

You haven't provided any info about what hardware you are running this
on. You're currently getting a search result every 2ms. 500 queries per
second may be a good result.

We don't know. We don't know what queries you're running, how you're
running your queries, what your data looks like, how much RAM you have,
what CPU you have, what disks, etc

As such, it's very difficult to tell you whether you can get better
results with your current setup.

clint

--


(Clinton Gormley) #13

Hi Abhishek

Thanks for your reply. I dint think about parallelizing the code as I
was interested in seeing the best I could do with one thread and then
may be use multiple threads.

Right - running in parallel will very likely get you better throughput.

ES is running on a debian5-2 OS, x86_64 16 cores with 132 Gb of ram.
The filesystem being used is IBM GPFS.

You don't mention how much of that RAM is dedicated to the ES heap, or
whether anything else is running on that box.

You really want elasticsearch to be the only thing consuming resources,
so don't share it with other code. Re heap settings, you should make
the heap about 60-70% of total RAM, so that there is plenty of space
left for kernel file system caches.

Also, (you may already be doing this) you want to make sure that none of
that memory is ever being swapped out (see bootstrap.mlockall and ulimit
-l)

The data I indexed had the following structure
{

"tax_id" : 45
"taxa_name" : Mycocosm
}

The query I am making is basically
{"taxa_name":"mycocosm"}

Do you need full text search on 'mycocosm'? eg do you need to find the
most relevant match, or find that text in text like "Mycocosms are FUN!"

Or do you just need to use the taxa_name as a filter (the equivalent of
WHERE taxa_name = 'mycocosm')

If the latter, then consider making taxa_name 'not_analyzed' and using a
term FILTER to search for it. Filters don't have the scoring phase of
queries, and their results can be cached, which means they perform
better.

Once I am able to get good performance(I am not sure how good is good
in terms of ES) the plan is to insert 500 million such records and
query them.

Definitely try this in parallel, and from a different box - not the same
node where es is running

clint

Thanks!
-Abhi

On Thursday, August 23, 2012 2:37:47 PM UTC-7, Clinton Gormley wrote:
Hi Abhi

    On Thu, 2012-08-23 at 14:25 -0700, Abhishek Pratap wrote: 
    > Guys I am stuck and need some guidance in order to move fwd
    with ES 
    > and use it. 
    
    > Mainly the bottle neck is #search queries I am able to make
    ( 1000 
    > queries in 2-3 seconds). Can this be scaled up ? 
    
    You haven't provided any info about what hardware you are
    running this 
    on.  You're currently getting a search result every 2ms. 500
    queries per 
    second may be a good result.     
    
    We don't know.  We don't know what queries you're running, how
    you're 
    running your queries, what your data looks like, how much RAM
    you have, 
    what CPU you have, what disks, etc 
    
    As such, it's very difficult to tell you whether you can get
    better 
    results with your current setup.   
    
    
    clint 

--

--


(system) #14