ES write performance

The error doesn't show when I specified more than one node which is good
:), but if I put only one it will give a division by zero error.

But now It popped up another issue this time one ES itself. I defined the
second node to be only data and the cluster name is the same as the master
node. If I try to connect directly to the second node on 9501 or even 9201
it gives me timeout.
How can I get a second node to work in cluster?

On Tuesday, 29 October 2013 16:46:59 UTC, Honza Král wrote:

Oops, there was a bug in the sniffing code, now fixed in master:

Thrift plugin uses different format in cluster stats · elastic/elasticsearch-py@04afc03 · GitHub

Can you please try again with master?

Thanks!

On Tue, Oct 29, 2013 at 5:29 PM, Mauro Farracha <farr...@gmail.com<javascript:>

wrote:

It's not related to Thrift, using http also shares this behaviour.

On Tuesday, 29 October 2013 16:27:32 UTC, Mauro Farracha wrote:

Hi Honza,

Yep, I understand the issue around dependencies. The minimum you can do
is probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index type
solved the problem.

Since elasticsearch-py uses connection pooling with round-robin by
default, I was wondering if I could get more improvement if I had two nodes
up, since I would distribute the load between two servers, but using
ThriftConnection it throws an error which I don't understand why It happens
since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's an issue
    with sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request, right? So I
would end up sending one bulk action to node1 and the second would go to
node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha farr...@gmail.comwrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing the
connection part in from.

I thought it was the case, I update the code so that it should work in
the future.

I changed the serializer (used ujson as Jorg mentioned) and I got an
improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid additional
dependencies, but if it makes sense I will happily switch the client to
ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance increased
to 6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't need to do
so it will always have an advantage on us I am afraid, it shouldn't be this
big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on the
    server side as when using pyes. Am I missing something? With pyes all the
    properties were there, but using elasticsearch-py, only type appears on the
    server side and are not the ones I specified. On the server log, It shows
    "update_mapping [accesslogs] (dynamic)" which doesn't happen with pyes. I'm
    sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_**mapping(index=self.doc_**collection,
    doc_type=self.doc_type,body={'**properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_**mapping(index=self.doc_**collection,
doc_type=self.doc_type,body={*self.doc_type: {'properties':self.doc
*mapping}})

  • Also, can you guys share what's your performance on a single local
    node?

I haven't done any tests like this, it varies so much with different
HW/configuration/environment that there is little value in absolute
numbers, only thing that matters is the relative speed of python clients,
curl etc.

As I mention on my first post, these are my non-default
configurations, maybe there is still room for improvement? Not to mention
of course, that these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_size: 50%
indices.memory.min_index_buffer_size: 300mb
index.translog.flush_threshold: 30000
index.store.type: mmapfs
index.merge.policy.use_compound_file: false

these look reasonable though I am no expert. Also when using SSDs you
might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/**elasticsearch/life-after-ec2https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha farr...@gmail.comwrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script using
elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.elast
icsearch_conn,connection_class=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have the thrift
python module installed and I import the class. Don't know what I'm missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write elasticsearch-py
and made sure you can bypass the serialization by doing it yourself. If
needed you can even supply your own serializer - just create an instance
that has .dumps() and loads() methods and behaves the same as
elasticsearch.serializer.**JSONSerializer. you can then pass
it to the Elasticsearch class as an argument (serializer=my_faster_
**serializ
er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha farr...@gmail.comwrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python dictionary to
pyes driver. If I send a "string" json format I could pass the
serialization? Are you familiar with pyes driver?

I saw this method signature, but don't know what's the "header",
and the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended with
    a newline
  • header – a json document string must be ended with a
    newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed out
the CPU or was it waiting for network? You can try serializing the
documents yourself and passing json strings to the client's bulk() method
to make sure that's not the bottle neck (you can pass in list of strings or
just one big string and we will just pass it along).

The python client does more than curl - it serializes data and
parses output, that's at least 2 cpu intensive operations that need to
happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica, which affect
a lot the indexing performance.

Jörg

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@**googlegroups.**c****om.

For more options, visit https://groups.google.com/**grou******
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou
**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.**com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

when I just start two nodes with default settings it works just fine for
me, if I specify one or more nodes in the client. I can also easily connect
to their ports and the sniffing works for me. I cannot replicate any of
your issues other than the first one I fixed. Can you please try and
replicate it and describe how to do it?

Thanks!

On Tue, Oct 29, 2013 at 6:09 PM, Mauro Farracha farracha@gmail.com wrote:

The error doesn't show when I specified more than one node which is good
:), but if I put only one it will give a division by zero error.

But now It popped up another issue this time one ES itself. I defined the
second node to be only data and the cluster name is the same as the master
node. If I try to connect directly to the second node on 9501 or even 9201
it gives me timeout.
How can I get a second node to work in cluster?

On Tuesday, 29 October 2013 16:46:59 UTC, Honza Král wrote:

Oops, there was a bug in the sniffing code, now fixed in master:

https://github.com/**elasticsearch/elasticsearch-**py/commit/**
04afc03cdd6122bd8a7081f2956419**866c0bcfa1https://github.com/elasticsearch/elasticsearch-py/commit/04afc03cdd6122bd8a7081f2956419866c0bcfa1

Can you please try again with master?

Thanks!

On Tue, Oct 29, 2013 at 5:29 PM, Mauro Farracha farr...@gmail.comwrote:

It's not related to Thrift, using http also shares this behaviour.

On Tuesday, 29 October 2013 16:27:32 UTC, Mauro Farracha wrote:

Hi Honza,

Yep, I understand the issue around dependencies. The minimum you can do
is probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index type
solved the problem.

Since elasticsearch-py uses connection pooling with round-robin by
default, I was wondering if I could get more improvement if I had two nodes
up, since I would distribute the load between two servers, but using
ThriftConnection it throws an error which I don't understand why It happens
since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's an issue
    with sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request, right? So I
would end up sending one bulk action to node1 and the second would go to
node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha farr...@gmail.comwrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing the
connection part in from.

I thought it was the case, I update the code so that it should work in
the future.

I changed the serializer (used ujson as Jorg mentioned) and I got an
improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid additional
dependencies, but if it makes sense I will happily switch the client to
ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance
increased to 6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't need to
do so it will always have an advantage on us I am afraid, it shouldn't be
this big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on the
    server side as when using pyes. Am I missing something? With pyes all the
    properties were there, but using elasticsearch-py, only type appears on the
    server side and are not the ones I specified. On the server log, It shows
    "update_mapping [accesslogs] (dynamic)" which doesn't happen with pyes. I'm
    sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_mapping(index=self.doc_**collection,
    doc_type=self.doc_type,body={'****properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_mapping(index=self.doc_**collection,
doc_type=self.doc_type,body={self.doc_type:
{'properties':self.doc
mapping}})

  • Also, can you guys share what's your performance on a single local
    node?

I haven't done any tests like this, it varies so much with different
HW/configuration/environment that there is little value in absolute
numbers, only thing that matters is the relative speed of python clients,
curl etc.

As I mention on my first post, these are my non-default
configurations, maybe there is still room for improvement? Not to mention
of course, that these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_size: 50%
indices.memory.min_index_**buffe
r_size: 300mb
index.translog.flush_threshold
: 30000
index.store.type: mmapfs
index.merge.policy.use_**compoun****d_file: false

these look reasonable though I am no expert. Also when using SSDs you
might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/**elasti**csearch/life-after-ec2https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha farr...@gmail.comwrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script using
elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.elast**
icsearch_conn,connection_class**=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have the
thrift python module installed and I import the class. Don't know what I'm
missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write elasticsearch-py
and made sure you can bypass the serialization by doing it yourself. If
needed you can even supply your own serializer - just create an instance
that has .dumps() and loads() methods and behaves the same as
elasticsearch.serializer.JSONSerializer. you can then
pass it to the Elasticsearch class as an argument (serializer=my_faster_
serializ
er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha <farr...@gmail.com

wrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python dictionary to
pyes driver. If I send a "string" json format I could pass the
serialization? Are you familiar with pyes driver?

I saw this method signature, but don't know what's the "header",
and the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended with
    a newline
  • header – a json document string must be ended with a
    newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed out
the CPU or was it waiting for network? You can try serializing the
documents yourself and passing json strings to the client's bulk() method
to make sure that's not the bottle neck (you can pass in list of strings or
just one big string and we will just pass it along).

The python client does more than curl - it serializes data and
parses output, that's at least 2 cpu intensive operations that need to
happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica, which affect
a lot the indexing performance.

Jörg

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.c**om
.

For more options, visit https://groups.google.com/**grou*******
*ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**grou

ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou
**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ok.

I have two local nodes with the same cluster name configuration (copy&paste
to a new folder), the only change was that on the second (port: 9501) I set
master to false, since 9500 was the master one.

This is how I connect:

self.elasticsearch_conn = ['localhost:9501'] #,'localhost:9500'

self.client = Elasticsearch(hosts=self.elasticsearch_conn,
connection_class=ThriftConnection, serializer=CJSerializer(),
sniff_on_start=True, sniff_on_connection_fail=True, sniffer_timeout=60)

Scenarios:

  • If I use two nodes, 9500 and 9501, It "works"
  • If I use 9500, It works (no exception and index properly)
  • If I use 9501, It gives me division by zero exception if using sniff
    properties, otherwise timeout when trying to index/bulk
    • The problem seems to be that second node is not available (don't
      know why). If I try to run a query on curl for 9201 It will also timeout

I'm pretty sure that I'm probably missing a config option somewhere on ES
config file besides the changes I mentioned above.

On Tuesday, 29 October 2013 17:18:47 UTC, Honza Král wrote:

when I just start two nodes with default settings it works just fine for
me, if I specify one or more nodes in the client. I can also easily connect
to their ports and the sniffing works for me. I cannot replicate any of
your issues other than the first one I fixed. Can you please try and
replicate it and describe how to do it?

Thanks!

On Tue, Oct 29, 2013 at 6:09 PM, Mauro Farracha <farr...@gmail.com<javascript:>

wrote:

The error doesn't show when I specified more than one node which is good
:), but if I put only one it will give a division by zero error.

But now It popped up another issue this time one ES itself. I defined the
second node to be only data and the cluster name is the same as the master
node. If I try to connect directly to the second node on 9501 or even 9201
it gives me timeout.
How can I get a second node to work in cluster?

On Tuesday, 29 October 2013 16:46:59 UTC, Honza Král wrote:

Oops, there was a bug in the sniffing code, now fixed in master:

https://github.com/**elasticsearch/elasticsearch-**py/commit/**
04afc03cdd6122bd8a7081f2956419**866c0bcfa1https://github.com/elasticsearch/elasticsearch-py/commit/04afc03cdd6122bd8a7081f2956419866c0bcfa1

Can you please try again with master?

Thanks!

On Tue, Oct 29, 2013 at 5:29 PM, Mauro Farracha farr...@gmail.comwrote:

It's not related to Thrift, using http also shares this behaviour.

On Tuesday, 29 October 2013 16:27:32 UTC, Mauro Farracha wrote:

Hi Honza,

Yep, I understand the issue around dependencies. The minimum you can
do is probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index type
solved the problem.

Since elasticsearch-py uses connection pooling with round-robin by
default, I was wondering if I could get more improvement if I had two nodes
up, since I would distribute the load between two servers, but using
ThriftConnection it throws an error which I don't understand why It happens
since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's an issue
    with sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request, right? So
I would end up sending one bulk action to node1 and the second would go to
node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha farr...@gmail.comwrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing the
connection part in from.

I thought it was the case, I update the code so that it should work
in the future.

I changed the serializer (used ujson as Jorg mentioned) and I got an
improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid additional
dependencies, but if it makes sense I will happily switch the client to
ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance
increased to 6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't need to
do so it will always have an advantage on us I am afraid, it shouldn't be
this big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on the
    server side as when using pyes. Am I missing something? With pyes all the
    properties were there, but using elasticsearch-py, only type appears on the
    server side and are not the ones I specified. On the server log, It shows
    "update_mapping [accesslogs] (dynamic)" which doesn't happen with pyes. I'm
    sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_mapping(index=self.doc_**collection,
    doc_type=self.doc_type,body={'****properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_mapping(index=self.doc_**collection,
doc_type=self.doc_type,body={self.doc_type:
{'properties':self.doc
mapping}})

  • Also, can you guys share what's your performance on a single local
    node?

I haven't done any tests like this, it varies so much with different
HW/configuration/environment that there is little value in absolute
numbers, only thing that matters is the relative speed of python clients,
curl etc.

As I mention on my first post, these are my non-default
configurations, maybe there is still room for improvement? Not to mention
of course, that these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_size: 50%
indices.memory.min_index_**buffe
r_size: 300mb
index.translog.flush_threshold
: 30000
index.store.type: mmapfs
index.merge.policy.use_**compoun****d_file: false

these look reasonable though I am no expert. Also when using SSDs you
might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/**elasti**csearch/life-after-ec2https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha farr...@gmail.comwrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script using
elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.elast**
icsearch_conn,connection_class**=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have the
thrift python module installed and I import the class. Don't know what I'm
missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write elasticsearch-py
and made sure you can bypass the serialization by doing it yourself. If
needed you can even supply your own serializer - just create an instance
that has .dumps() and loads() methods and behaves the same as
elasticsearch.serializer.JSONSerializer. you can then
pass it to the Elasticsearch class as an argument (serializer=my_faster_
serializ
er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python dictionary to
pyes driver. If I send a "string" json format I could pass the
serialization? Are you familiar with pyes driver?

I saw this method signature, but don't know what's the "header",
and the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended
    with a newline
  • header – a json document string must be ended with a
    newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed out
the CPU or was it waiting for network? You can try serializing the
documents yourself and passing json strings to the client's bulk() method
to make sure that's not the bottle neck (you can pass in list of strings or
just one big string and we will just pass it along).

The python client does more than curl - it serializes data and
parses output, that's at least 2 cpu intensive operations that need to
happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica, which
affect a lot the indexing performance.

Jörg

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.c**
om.

For more options, visit https://groups.google.com/**grou******
**ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**grou

ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou
**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

are you sure the second node has the thrift plugin installed? Looks like it
doesn't.

Also when you are in a configuration that works, can you try .info() a
couple of times and observer the node name to see if it's changing? looks
like one of the nodes doesn't have thrift so you are only talking to one.

On Tue, Oct 29, 2013 at 6:29 PM, Mauro Farracha farracha@gmail.com wrote:

Ok.

I have two local nodes with the same cluster name configuration
(copy&paste to a new folder), the only change was that on the second (port:
9501) I set master to false, since 9500 was the master one.

This is how I connect:

self.elasticsearch_conn = ['localhost:9501'] #,'localhost:9500'

self.client = Elasticsearch(hosts=self.elasticsearch_conn,
connection_class=ThriftConnection, serializer=CJSerializer(),
sniff_on_start=True, sniff_on_connection_fail=True, sniffer_timeout=60)

Scenarios:

  • If I use two nodes, 9500 and 9501, It "works"
  • If I use 9500, It works (no exception and index properly)
  • If I use 9501, It gives me division by zero exception if using sniff
    properties, otherwise timeout when trying to index/bulk
    • The problem seems to be that second node is not available (don't
      know why). If I try to run a query on curl for 9201 It will also timeout

I'm pretty sure that I'm probably missing a config option somewhere on ES
config file besides the changes I mentioned above.

On Tuesday, 29 October 2013 17:18:47 UTC, Honza Král wrote:

when I just start two nodes with default settings it works just fine for
me, if I specify one or more nodes in the client. I can also easily connect
to their ports and the sniffing works for me. I cannot replicate any of
your issues other than the first one I fixed. Can you please try and
replicate it and describe how to do it?

Thanks!

On Tue, Oct 29, 2013 at 6:09 PM, Mauro Farracha farr...@gmail.comwrote:

The error doesn't show when I specified more than one node which is good
:), but if I put only one it will give a division by zero error.

But now It popped up another issue this time one ES itself. I defined
the second node to be only data and the cluster name is the same as the
master node. If I try to connect directly to the second node on 9501 or
even 9201 it gives me timeout.
How can I get a second node to work in cluster?

On Tuesday, 29 October 2013 16:46:59 UTC, Honza Král wrote:

Oops, there was a bug in the sniffing code, now fixed in master:

https://github.com/**elasticsear**ch/elasticsearch-**py/commit/**04a**
fc03cdd6122bd8a7081f2956419866c0bcfa1https://github.com/elasticsearch/elasticsearch-py/commit/04afc03cdd6122bd8a7081f2956419866c0bcfa1

Can you please try again with master?

Thanks!

On Tue, Oct 29, 2013 at 5:29 PM, Mauro Farracha farr...@gmail.comwrote:

It's not related to Thrift, using http also shares this behaviour.

On Tuesday, 29 October 2013 16:27:32 UTC, Mauro Farracha wrote:

Hi Honza,

Yep, I understand the issue around dependencies. The minimum you can
do is probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index type
solved the problem.

Since elasticsearch-py uses connection pooling with round-robin by
default, I was wondering if I could get more improvement if I had two nodes
up, since I would distribute the load between two servers, but using
ThriftConnection it throws an error which I don't understand why It happens
since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's an issue
    with sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request, right? So
I would end up sending one bulk action to node1 and the second would go to
node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha farr...@gmail.comwrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing the
connection part in from.

I thought it was the case, I update the code so that it should work
in the future.

I changed the serializer (used ujson as Jorg mentioned) and I got
an improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid additional
dependencies, but if it makes sense I will happily switch the client to
ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance
increased to 6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't need to
do so it will always have an advantage on us I am afraid, it shouldn't be
this big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on the
    server side as when using pyes. Am I missing something? With pyes all the
    properties were there, but using elasticsearch-py, only type appears on the
    server side and are not the ones I specified. On the server log, It shows
    "update_mapping [accesslogs] (dynamic)" which doesn't happen with pyes. I'm
    sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_**mappin****g(index=self.doc_**collection,
    doc_type=self.doc_type,body={'******properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_mapping(index=self.doc_**collection,
doc_type=self.doc_type,body={**s
elf.doc_type:
{'properties':self.doc
mapping
}})

  • Also, can you guys share what's your performance on a single
    local node?

I haven't done any tests like this, it varies so much with different
HW/configuration/environment that there is little value in absolute
numbers, only thing that matters is the relative speed of python clients,
curl etc.

As I mention on my first post, these are my non-default
configurations, maybe there is still room for improvement? Not to mention
of course, that these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_size: 50%
indices.memory.min_index_buffe
r_size: 300mb
index.translog.flush_threshold****: 30000
index.store.type: mmapfs
index.merge.policy.use_compoun****d_file: false

these look reasonable though I am no expert. Also when using SSDs
you might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/**elasti****csearch/life-after-ec2https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha <farr...@gmail.com

wrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script using
elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.elast****
icsearch_conn,connection_class****=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have the
thrift python module installed and I import the class. Don't know what I'm
missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write
elasticsearch-py and made sure you can bypass the serialization by doing it
yourself. If needed you can even supply your own serializer - just create
an instance that has .dumps() and loads() methods and behaves the same as
elasticsearch.serializer.JSONSerializer. you can then
pass it to the Elasticsearch class as an argument (serializer=my_faster_
serializ
er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python dictionary
to pyes driver. If I send a "string" json format I could pass the
serialization? Are you familiar with pyes driver?

I saw this method signature, but don't know what's the
"header", and the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended
    with a newline
  • header – a json document string must be ended with a
    newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed out
the CPU or was it waiting for network? You can try serializing the
documents yourself and passing json strings to the client's bulk() method
to make sure that's not the bottle neck (you can pass in list of strings or
just one big string and we will just pass it along).

The python client does more than curl - it serializes data and
parses output, that's at least 2 cpu intensive operations that need to
happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica, which
affect a lot the indexing performance.

Jörg

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.c**
**om.

For more options, visit https://groups.google.com/**grou*****
*****ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.com
.
For more options, visit https://groups.google.com/**grou
*****
*ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**grou

ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.**com.
For more options, visit https://groups.google.com/**grou

ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yes, It's installed. The problem is when initialising the second node It
doesn't discover the master node. Even the port 9201 gives me timeout.

This is the startup logging on the second node:
[2013-10-29 17:44:47,437][INFO ][node ]
version[0.90.5], pid[12297], build[c8714e8/2013-09-17T12:50:20Z]
[2013-10-29 17:44:47,437][INFO ][node ] initializing ...
[2013-10-29 17:44:47,447][INFO ][plugins ] loaded
[transport-thrift], sites
[2013-10-29 17:44:49,331][INFO ][node ] initialized
[2013-10-29 17:44:49,331][INFO ][node ] starting ...
[2013-10-29 17:44:49,350][INFO ][thrift ] bound on port
[9501]
[2013-10-29 17:44:49,442][INFO ][transport ] bound_address
{inet[/0:0:0:0:0:0:0:0%0:9301]}, publish_address {inet[/172.21.71.88:9301]}
[2013-10-29 17:45:19,483][WARN ][discovery ] waited for 30s
and no initial state was set by the discovery
[2013-10-29 17:45:19,484][INFO ][discovery ]
mdf_elasticsearch/0reUSkwsQDuGpSOaf8tiwA
[2013-10-29 17:45:19,489][INFO ][http ] bound_address
{inet[/0:0:0:0:0:0:0:0%0:9201]}, publish_address {inet[/172.21.71.88:9201]}
[2013-10-29 17:45:19,489][INFO ][node ] started

When I try using a query on 9201, It returns
{
"error": "MasterNotDiscoveredException[waited for [30s]]",
"status": 503
}

On Tuesday, 29 October 2013 17:33:11 UTC, Honza Král wrote:

are you sure the second node has the thrift plugin installed? Looks like
it doesn't.

Also when you are in a configuration that works, can you try .info() a
couple of times and observer the node name to see if it's changing? looks
like one of the nodes doesn't have thrift so you are only talking to one.

On Tue, Oct 29, 2013 at 6:29 PM, Mauro Farracha <farr...@gmail.com<javascript:>

wrote:

Ok.

I have two local nodes with the same cluster name configuration
(copy&paste to a new folder), the only change was that on the second (port:
9501) I set master to false, since 9500 was the master one.

This is how I connect:

self.elasticsearch_conn = ['localhost:9501'] #,'localhost:9500'

self.client = Elasticsearch(hosts=self.elasticsearch_conn,
connection_class=ThriftConnection, serializer=CJSerializer(),
sniff_on_start=True, sniff_on_connection_fail=True, sniffer_timeout=60)

Scenarios:

  • If I use two nodes, 9500 and 9501, It "works"
  • If I use 9500, It works (no exception and index properly)
  • If I use 9501, It gives me division by zero exception if using sniff
    properties, otherwise timeout when trying to index/bulk
    • The problem seems to be that second node is not available (don't
      know why). If I try to run a query on curl for 9201 It will also timeout

I'm pretty sure that I'm probably missing a config option somewhere on ES
config file besides the changes I mentioned above.

On Tuesday, 29 October 2013 17:18:47 UTC, Honza Král wrote:

when I just start two nodes with default settings it works just fine for
me, if I specify one or more nodes in the client. I can also easily connect
to their ports and the sniffing works for me. I cannot replicate any of
your issues other than the first one I fixed. Can you please try and
replicate it and describe how to do it?

Thanks!

On Tue, Oct 29, 2013 at 6:09 PM, Mauro Farracha farr...@gmail.comwrote:

The error doesn't show when I specified more than one node which is
good :), but if I put only one it will give a division by zero error.

But now It popped up another issue this time one ES itself. I defined
the second node to be only data and the cluster name is the same as the
master node. If I try to connect directly to the second node on 9501 or
even 9201 it gives me timeout.
How can I get a second node to work in cluster?

On Tuesday, 29 October 2013 16:46:59 UTC, Honza Král wrote:

Oops, there was a bug in the sniffing code, now fixed in master:

https://github.com/**elasticsear**ch/elasticsearch-**py/commit/**04a**
fc03cdd6122bd8a7081f2956419866c0bcfa1https://github.com/elasticsearch/elasticsearch-py/commit/04afc03cdd6122bd8a7081f2956419866c0bcfa1

Can you please try again with master?

Thanks!

On Tue, Oct 29, 2013 at 5:29 PM, Mauro Farracha farr...@gmail.comwrote:

It's not related to Thrift, using http also shares this behaviour.

On Tuesday, 29 October 2013 16:27:32 UTC, Mauro Farracha wrote:

Hi Honza,

Yep, I understand the issue around dependencies. The minimum you can
do is probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index type
solved the problem.

Since elasticsearch-py uses connection pooling with round-robin by
default, I was wondering if I could get more improvement if I had two nodes
up, since I would distribute the load between two servers, but using
ThriftConnection it throws an error which I don't understand why It happens
since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's an
    issue with sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request, right?
So I would end up sending one bulk action to node1 and the second would go
to node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha farr...@gmail.comwrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing the
connection part in from.

I thought it was the case, I update the code so that it should work
in the future.

I changed the serializer (used ujson as Jorg mentioned) and I got
an improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid
additional dependencies, but if it makes sense I will happily switch the
client to ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance
increased to 6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't need
to do so it will always have an advantage on us I am afraid, it shouldn't
be this big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on the
    server side as when using pyes. Am I missing something? With pyes all the
    properties were there, but using elasticsearch-py, only type appears on the
    server side and are not the ones I specified. On the server log, It shows
    "update_mapping [accesslogs] (dynamic)" which doesn't happen with pyes. I'm
    sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_**mappin****g(index=self.doc_**collection,
    doc_type=self.doc_type,body={'******properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_mapping(index=self.doc_**collection,
doc_type=self.doc_type,body={**s
elf.doc_type:
{'properties':self.doc
mapping
}})

  • Also, can you guys share what's your performance on a single
    local node?

I haven't done any tests like this, it varies so much with
different HW/configuration/environment that there is little value in
absolute numbers, only thing that matters is the relative speed of python
clients, curl etc.

As I mention on my first post, these are my non-default
configurations, maybe there is still room for improvement? Not to mention
of course, that these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_size: 50%
indices.memory.min_index_buffe
r_size: 300mb
index.translog.flush_threshold****: 30000
index.store.type: mmapfs
index.merge.policy.use_compoun****d_file: false

these look reasonable though I am no expert. Also when using SSDs
you might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/**elasti****csearch/life-after-ec2https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script using
elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.elast****
icsearch_conn,connection_class****=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have the
thrift python module installed and I import the class. Don't know what I'm
missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write
elasticsearch-py and made sure you can bypass the serialization by doing it
yourself. If needed you can even supply your own serializer - just create
an instance that has .dumps() and loads() methods and behaves the same as
elasticsearch.serializer.JSONSerializer. you can
then pass it to the Elasticsearch class as an argument
(serializer=my_faster_serializ
er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python dictionary
to pyes driver. If I send a "string" json format I could pass the
serialization? Are you familiar with pyes driver?

I saw this method signature, but don't know what's the
"header", and the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended
    with a newline
  • header – a json document string must be ended with a
    newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed
out the CPU or was it waiting for network? You can try serializing the
documents yourself and passing json strings to the client's bulk() method
to make sure that's not the bottle neck (you can pass in list of strings or
just one big string and we will just pass it along).

The python client does more than curl - it serializes data
and parses output, that's at least 2 cpu intensive operations that need to
happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica, which
affect a lot the indexing performance.

Jörg

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@**googlegroups.**c
********om.

For more options, visit https://groups.google.com/**grou****
******ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.c**
om.
For more options, visit https://groups.google.com/**grou******
**ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**grou

ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.**com.
For more options, visit https://groups.google.com/**grou

ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

ah, yep so the problem is in discovery, not the client. That's a relief :slight_smile:

Have you disabled multicast discovery? Have you tried providing a list of
seed nodes?

On Tue, Oct 29, 2013 at 6:49 PM, Mauro Farracha farracha@gmail.com wrote:

Yes, It's installed. The problem is when initialising the second node It
doesn't discover the master node. Even the port 9201 gives me timeout.

This is the startup logging on the second node:
[2013-10-29 17:44:47,437][INFO ][node ]
version[0.90.5], pid[12297], build[c8714e8/2013-09-17T12:50:20Z]
[2013-10-29 17:44:47,437][INFO ][node ] initializing
...
[2013-10-29 17:44:47,447][INFO ][plugins ] loaded
[transport-thrift], sites
[2013-10-29 17:44:49,331][INFO ][node ] initialized
[2013-10-29 17:44:49,331][INFO ][node ] starting ...
[2013-10-29 17:44:49,350][INFO ][thrift ] bound on port
[9501]
[2013-10-29 17:44:49,442][INFO ][transport ] bound_address
{inet[/0:0:0:0:0:0:0:0%0:9301]}, publish_address {inet[/172.21.71.88:9301
]}
[2013-10-29 17:45:19,483][WARN ][discovery ] waited for 30s
and no initial state was set by the discovery
[2013-10-29 17:45:19,484][INFO ][discovery ]
mdf_elasticsearch/0reUSkwsQDuGpSOaf8tiwA
[2013-10-29 17:45:19,489][INFO ][http ] bound_address
{inet[/0:0:0:0:0:0:0:0%0:9201]}, publish_address {inet[/172.21.71.88:9201
]}
[2013-10-29 17:45:19,489][INFO ][node ] started

When I try using a query on 9201, It returns
{
"error": "MasterNotDiscoveredException[waited for [30s]]",
"status": 503
}

On Tuesday, 29 October 2013 17:33:11 UTC, Honza Král wrote:

are you sure the second node has the thrift plugin installed? Looks like
it doesn't.

Also when you are in a configuration that works, can you try .info() a
couple of times and observer the node name to see if it's changing? looks
like one of the nodes doesn't have thrift so you are only talking to one.

On Tue, Oct 29, 2013 at 6:29 PM, Mauro Farracha farr...@gmail.comwrote:

Ok.

I have two local nodes with the same cluster name configuration
(copy&paste to a new folder), the only change was that on the second (port:
9501) I set master to false, since 9500 was the master one.

This is how I connect:

self.elasticsearch_conn = ['localhost:9501'] #,'localhost:9500'

self.client = Elasticsearch(hosts=self.**elasticsearch_conn,
connection_class=**ThriftConnection, serializer=CJSerializer(),
sniff_on_start=True, sniff_on_connection_fail=True, sniffer_timeout=60)

Scenarios:

  • If I use two nodes, 9500 and 9501, It "works"
  • If I use 9500, It works (no exception and index properly)
  • If I use 9501, It gives me division by zero exception if using sniff
    properties, otherwise timeout when trying to index/bulk
    • The problem seems to be that second node is not available (don't
      know why). If I try to run a query on curl for 9201 It will also timeout

I'm pretty sure that I'm probably missing a config option somewhere on
ES config file besides the changes I mentioned above.

On Tuesday, 29 October 2013 17:18:47 UTC, Honza Král wrote:

when I just start two nodes with default settings it works just fine
for me, if I specify one or more nodes in the client. I can also easily
connect to their ports and the sniffing works for me. I cannot replicate
any of your issues other than the first one I fixed. Can you please try and
replicate it and describe how to do it?

Thanks!

On Tue, Oct 29, 2013 at 6:09 PM, Mauro Farracha farr...@gmail.comwrote:

The error doesn't show when I specified more than one node which is
good :), but if I put only one it will give a division by zero error.

But now It popped up another issue this time one ES itself. I defined
the second node to be only data and the cluster name is the same as the
master node. If I try to connect directly to the second node on 9501 or
even 9201 it gives me timeout.
How can I get a second node to work in cluster?

On Tuesday, 29 October 2013 16:46:59 UTC, Honza Král wrote:

Oops, there was a bug in the sniffing code, now fixed in master:

https://github.com/**elasticsear****ch/elasticsearch-**py/commit/**
04afc03cdd6122bd8a7081f2956419**866c0bcfa1https://github.com/elasticsearch/elasticsearch-py/commit/04afc03cdd6122bd8a7081f2956419866c0bcfa1

Can you please try again with master?

Thanks!

On Tue, Oct 29, 2013 at 5:29 PM, Mauro Farracha farr...@gmail.comwrote:

It's not related to Thrift, using http also shares this behaviour.

On Tuesday, 29 October 2013 16:27:32 UTC, Mauro Farracha wrote:

Hi Honza,

Yep, I understand the issue around dependencies. The minimum you
can do is probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index type
solved the problem.

Since elasticsearch-py uses connection pooling with round-robin by
default, I was wondering if I could get more improvement if I had two nodes
up, since I would distribute the load between two servers, but using
ThriftConnection it throws an error which I don't understand why It happens
since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's an
    issue with sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request, right?
So I would end up sending one bulk action to node1 and the second would go
to node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha <farr...@gmail.com

wrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing the
connection part in from.

I thought it was the case, I update the code so that it should
work in the future.

I changed the serializer (used ujson as Jorg mentioned) and I got
an improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid
additional dependencies, but if it makes sense I will happily switch the
client to ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance
increased to 6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't need
to do so it will always have an advantage on us I am afraid, it shouldn't
be this big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on the
    server side as when using pyes. Am I missing something? With pyes all the
    properties were there, but using elasticsearch-py, only type appears on the
    server side and are not the ones I specified. On the server log, It shows
    "update_mapping [accesslogs] (dynamic)" which doesn't happen with pyes. I'm
    sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_mapping(index=self.doc_collection,
    doc_type=self.doc_type,body={'
    **
    properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_mapping(index=self.doc_collection,
doc_type=self.doc_type,body={s
elf.doc_type:
{'properties':self.doc
mapping
**}})

  • Also, can you guys share what's your performance on a single
    local node?

I haven't done any tests like this, it varies so much with
different HW/configuration/environment that there is little value in
absolute numbers, only thing that matters is the relative speed of python
clients, curl etc.

As I mention on my first post, these are my non-default
configurations, maybe there is still room for improvement? Not to mention
of course, that these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_size: 50%
indices.memory.min_index_buffe
r_size: 300mb
index.translog.flush_threshold******: 30000
index.store.type: mmapfs
index.merge.policy.use_compoun******d_file: false

these look reasonable though I am no expert. Also when using SSDs
you might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/**elasti******csearch/life-after-ec2https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script using
elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.elast******
icsearch_conn,connection_class******=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have the
thrift python module installed and I import the class. Don't know what I'm
missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write
elasticsearch-py and made sure you can bypass the serialization by doing it
yourself. If needed you can even supply your own serializer - just create
an instance that has .dumps() and loads() methods and behaves the same as
elasticsearch.serializer.**JSONSerializer. you can
then pass it to the Elasticsearch class as an argument
(serializer=my_faster_**serializ
er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python dictionary
to pyes driver. If I send a "string" json format I could pass the
serialization? Are you familiar with pyes driver?

I saw this method signature, but don't know what's the
"header", and the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended
    with a newline
  • header – a json document string must be ended with a
    newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed
out the CPU or was it waiting for network? You can try serializing the
documents yourself and passing json strings to the client's bulk() method
to make sure that's not the bottle neck (you can pass in list of strings or
just one big string and we will just pass it along).

The python client does more than curl - it serializes data
and parses output, that's at least 2 cpu intensive operations that need to
happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica, which
affect a lot the indexing performance.

Jörg

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@googlegroups.
c**********om.

For more options, visit https://groups.google.com/**grou***
*********ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.c**
om.
For more options, visit https://groups.google.com/**grou
***
*****ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.com
.
For more options, visit https://groups.google.com/**grou
*****
*ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou
**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou
**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yes, multicast doesn't work as expect! defining a list of nodes did the
trick.

Having two processes each one pointing to a different node doesn't cut it!
The performance on each is almost half if only one was used.

Well, It has been a nice thread, learned lot of new stuff.

On Tuesday, 29 October 2013 17:54:16 UTC, Honza Král wrote:

ah, yep so the problem is in discovery, not the client. That's a relief :slight_smile:

Have you disabled multicast discovery? Have you tried providing a list of
seed nodes?

On Tue, Oct 29, 2013 at 6:49 PM, Mauro Farracha <farr...@gmail.com<javascript:>

wrote:

Yes, It's installed. The problem is when initialising the second node It
doesn't discover the master node. Even the port 9201 gives me timeout.

This is the startup logging on the second node:
[2013-10-29 17:44:47,437][INFO ][node ]
version[0.90.5], pid[12297], build[c8714e8/2013-09-17T12:50:20Z]
[2013-10-29 17:44:47,437][INFO ][node ] initializing
...
[2013-10-29 17:44:47,447][INFO ][plugins ] loaded
[transport-thrift], sites
[2013-10-29 17:44:49,331][INFO ][node ] initialized
[2013-10-29 17:44:49,331][INFO ][node ] starting ...
[2013-10-29 17:44:49,350][INFO ][thrift ] bound on port
[9501]
[2013-10-29 17:44:49,442][INFO ][transport ] bound_address
{inet[/0:0:0:0:0:0:0:0%0:9301]}, publish_address {inet[/172.21.71.88:9301
]}
[2013-10-29 17:45:19,483][WARN ][discovery ] waited for
30s and no initial state was set by the discovery
[2013-10-29 17:45:19,484][INFO ][discovery ]
mdf_elasticsearch/0reUSkwsQDuGpSOaf8tiwA
[2013-10-29 17:45:19,489][INFO ][http ] bound_address
{inet[/0:0:0:0:0:0:0:0%0:9201]}, publish_address {inet[/172.21.71.88:9201
]}
[2013-10-29 17:45:19,489][INFO ][node ] started

When I try using a query on 9201, It returns
{
"error": "MasterNotDiscoveredException[waited for [30s]]",
"status": 503
}

On Tuesday, 29 October 2013 17:33:11 UTC, Honza Král wrote:

are you sure the second node has the thrift plugin installed? Looks like
it doesn't.

Also when you are in a configuration that works, can you try .info() a
couple of times and observer the node name to see if it's changing? looks
like one of the nodes doesn't have thrift so you are only talking to one.

On Tue, Oct 29, 2013 at 6:29 PM, Mauro Farracha farr...@gmail.comwrote:

Ok.

I have two local nodes with the same cluster name configuration
(copy&paste to a new folder), the only change was that on the second (port:
9501) I set master to false, since 9500 was the master one.

This is how I connect:

self.elasticsearch_conn = ['localhost:9501'] #,'localhost:9500'

self.client = Elasticsearch(hosts=self.**elasticsearch_conn,
connection_class=**ThriftConnection, serializer=CJSerializer(),
sniff_on_start=True, sniff_on_connection_fail=True, sniffer_timeout=60)

Scenarios:

  • If I use two nodes, 9500 and 9501, It "works"
  • If I use 9500, It works (no exception and index properly)
  • If I use 9501, It gives me division by zero exception if using sniff
    properties, otherwise timeout when trying to index/bulk
    • The problem seems to be that second node is not available (don't
      know why). If I try to run a query on curl for 9201 It will also timeout

I'm pretty sure that I'm probably missing a config option somewhere on
ES config file besides the changes I mentioned above.

On Tuesday, 29 October 2013 17:18:47 UTC, Honza Král wrote:

when I just start two nodes with default settings it works just fine
for me, if I specify one or more nodes in the client. I can also easily
connect to their ports and the sniffing works for me. I cannot replicate
any of your issues other than the first one I fixed. Can you please try and
replicate it and describe how to do it?

Thanks!

On Tue, Oct 29, 2013 at 6:09 PM, Mauro Farracha farr...@gmail.comwrote:

The error doesn't show when I specified more than one node which is
good :), but if I put only one it will give a division by zero error.

But now It popped up another issue this time one ES itself. I defined
the second node to be only data and the cluster name is the same as the
master node. If I try to connect directly to the second node on 9501 or
even 9201 it gives me timeout.
How can I get a second node to work in cluster?

On Tuesday, 29 October 2013 16:46:59 UTC, Honza Král wrote:

Oops, there was a bug in the sniffing code, now fixed in master:

https://github.com/**elasticsear****ch/elasticsearch-**py/commit/**
04afc03cdd6122bd8a7081f2956419**866c0bcfa1https://github.com/elasticsearch/elasticsearch-py/commit/04afc03cdd6122bd8a7081f2956419866c0bcfa1

Can you please try again with master?

Thanks!

On Tue, Oct 29, 2013 at 5:29 PM, Mauro Farracha farr...@gmail.comwrote:

It's not related to Thrift, using http also shares this behaviour.

On Tuesday, 29 October 2013 16:27:32 UTC, Mauro Farracha wrote:

Hi Honza,

Yep, I understand the issue around dependencies. The minimum you
can do is probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index type
solved the problem.

Since elasticsearch-py uses connection pooling with round-robin by
default, I was wondering if I could get more improvement if I had two nodes
up, since I would distribute the load between two servers, but using
ThriftConnection it throws an error which I don't understand why It happens
since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's an
    issue with sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request, right?
So I would end up sending one bulk action to node1 and the second would go
to node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing the
connection part in from.

I thought it was the case, I update the code so that it should
work in the future.

I changed the serializer (used ujson as Jorg mentioned) and I
got an improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid
additional dependencies, but if it makes sense I will happily switch the
client to ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance
increased to 6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't need
to do so it will always have an advantage on us I am afraid, it shouldn't
be this big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on
    the server side as when using pyes. Am I missing something? With pyes all
    the properties were there, but using elasticsearch-py, only type appears on
    the server side and are not the ones I specified. On the server log, It
    shows "update_mapping [accesslogs] (dynamic)" which doesn't happen with
    pyes. I'm sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_mapping(index=self.doc_
    collection, doc_type=self.doc_type,body={'
    ****
    properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_mapping(index=self.doc_collection,
doc_type=self.doc_type,body={s
elf.doc_type:
{'properties':self.doc
mapping
**}})

  • Also, can you guys share what's your performance on a single
    local node?

I haven't done any tests like this, it varies so much with
different HW/configuration/environment that there is little value in
absolute numbers, only thing that matters is the relative speed of python
clients, curl etc.

As I mention on my first post, these are my non-default
configurations, maybe there is still room for improvement? Not to mention
of course, that these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_size: 50%
indices.memory.min_index_buffe
r_size: 300mb
index.translog.flush_threshold******: 30000
index.store.type: mmapfs
index.merge.policy.use_compoun******d_file: false

these look reasonable though I am no expert. Also when using SSDs
you might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/**elasti******csearch/life-after-ec2https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script
using elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.elast******
icsearch_conn,connection_class******=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have the
thrift python module installed and I import the class. Don't know what I'm
missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write
elasticsearch-py and made sure you can bypass the serialization by doing it
yourself. If needed you can even supply your own serializer - just create
an instance that has .dumps() and loads() methods and behaves the same as
elasticsearch.serializer.**JSONSerializer. you can
then pass it to the Elasticsearch class as an argument
(serializer=my_faster_**serializ
er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python
dictionary to pyes driver. If I send a "string" json format I could pass
the serialization? Are you familiar with pyes driver?

I saw this method signature, but don't know what's the
"header", and the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended
    with a newline
  • header – a json document string must be ended with a
    newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed
out the CPU or was it waiting for network? You can try serializing the
documents yourself and passing json strings to the client's bulk() method
to make sure that's not the bottle neck (you can pass in list of strings or
just one big string and we will just pass it along).

The python client does more than curl - it serializes data
and parses output, that's at least 2 cpu intensive operations that need to
happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica, which
affect a lot the indexing performance.

Jörg

--
You received this message because you are subscribed to
the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@*googlegroups.
c*********om.

For more options, visit https://groups.google.com/**grou**
**********ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@**googlegroups.**c
****om.
For more options, visit https://groups.google.com/**grou

******ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.c**
om.
For more options, visit https://groups.google.com/**grou******
**ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou
**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou
**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Honza,

How can I bypass json serialization using elasticsearch-py?

From the documentation we have:
Parameters:

  • body – The operation definition and data (action-data pairs)
  • index – Default index for items which don’t provide one
  • doc_type – Default document type for items which don’t provide one
  • consistency – Explicit write consistency setting for the operation
  • refresh – Refresh the index after performing the operation
  • replication – Explicitly set the replication type (efault: sync)

Can you provide an example of a small body where it would pass
serialization?
What options can I pass on consistency and replication and how do they
affect performance?

On Tuesday, 29 October 2013 18:04:58 UTC, Mauro Farracha wrote:

Yes, multicast doesn't work as expect! defining a list of nodes did the
trick.

Having two processes each one pointing to a different node doesn't cut it!
The performance on each is almost half if only one was used.

Well, It has been a nice thread, learned lot of new stuff.

On Tuesday, 29 October 2013 17:54:16 UTC, Honza Král wrote:

ah, yep so the problem is in discovery, not the client. That's a relief :slight_smile:

Have you disabled multicast discovery? Have you tried providing a list of
seed nodes?

On Tue, Oct 29, 2013 at 6:49 PM, Mauro Farracha farr...@gmail.comwrote:

Yes, It's installed. The problem is when initialising the second node It
doesn't discover the master node. Even the port 9201 gives me timeout.

This is the startup logging on the second node:
[2013-10-29 17:44:47,437][INFO ][node ]
version[0.90.5], pid[12297], build[c8714e8/2013-09-17T12:50:20Z]
[2013-10-29 17:44:47,437][INFO ][node ] initializing
...
[2013-10-29 17:44:47,447][INFO ][plugins ] loaded
[transport-thrift], sites
[2013-10-29 17:44:49,331][INFO ][node ] initialized
[2013-10-29 17:44:49,331][INFO ][node ] starting ...
[2013-10-29 17:44:49,350][INFO ][thrift ] bound on
port [9501]
[2013-10-29 17:44:49,442][INFO ][transport ]
bound_address {inet[/0:0:0:0:0:0:0:0%0:9301]}, publish_address {inet[/
172.21.71.88:9301]}
[2013-10-29 17:45:19,483][WARN ][discovery ] waited for
30s and no initial state was set by the discovery
[2013-10-29 17:45:19,484][INFO ][discovery ]
mdf_elasticsearch/0reUSkwsQDuGpSOaf8tiwA
[2013-10-29 17:45:19,489][INFO ][http ]
bound_address {inet[/0:0:0:0:0:0:0:0%0:9201]}, publish_address {inet[/
172.21.71.88:9201]}
[2013-10-29 17:45:19,489][INFO ][node ] started

When I try using a query on 9201, It returns
{
"error": "MasterNotDiscoveredException[waited for [30s]]",
"status": 503
}

On Tuesday, 29 October 2013 17:33:11 UTC, Honza Král wrote:

are you sure the second node has the thrift plugin installed? Looks
like it doesn't.

Also when you are in a configuration that works, can you try .info() a
couple of times and observer the node name to see if it's changing? looks
like one of the nodes doesn't have thrift so you are only talking to one.

On Tue, Oct 29, 2013 at 6:29 PM, Mauro Farracha farr...@gmail.comwrote:

Ok.

I have two local nodes with the same cluster name configuration
(copy&paste to a new folder), the only change was that on the second (port:
9501) I set master to false, since 9500 was the master one.

This is how I connect:

self.elasticsearch_conn = ['localhost:9501'] #,'localhost:9500'

self.client = Elasticsearch(hosts=self.**elasticsearch_conn,
connection_class=**ThriftConnection, serializer=CJSerializer(),
sniff_on_start=True, sniff_on_connection_fail=True, sniffer_timeout=60)

Scenarios:

  • If I use two nodes, 9500 and 9501, It "works"
  • If I use 9500, It works (no exception and index properly)
  • If I use 9501, It gives me division by zero exception if using sniff
    properties, otherwise timeout when trying to index/bulk
    • The problem seems to be that second node is not available
      (don't know why). If I try to run a query on curl for 9201 It will also
      timeout

I'm pretty sure that I'm probably missing a config option somewhere on
ES config file besides the changes I mentioned above.

On Tuesday, 29 October 2013 17:18:47 UTC, Honza Král wrote:

when I just start two nodes with default settings it works just fine
for me, if I specify one or more nodes in the client. I can also easily
connect to their ports and the sniffing works for me. I cannot replicate
any of your issues other than the first one I fixed. Can you please try and
replicate it and describe how to do it?

Thanks!

On Tue, Oct 29, 2013 at 6:09 PM, Mauro Farracha farr...@gmail.comwrote:

The error doesn't show when I specified more than one node which is
good :), but if I put only one it will give a division by zero error.

But now It popped up another issue this time one ES itself. I
defined the second node to be only data and the cluster name is the same as
the master node. If I try to connect directly to the second node on 9501 or
even 9201 it gives me timeout.
How can I get a second node to work in cluster?

On Tuesday, 29 October 2013 16:46:59 UTC, Honza Král wrote:

Oops, there was a bug in the sniffing code, now fixed in master:

https://github.com/**elasticsear****ch/elasticsearch-**py/commit/**
04afc03cdd6122bd8a7081f2956419**866c0bcfa1https://github.com/elasticsearch/elasticsearch-py/commit/04afc03cdd6122bd8a7081f2956419866c0bcfa1

Can you please try again with master?

Thanks!

On Tue, Oct 29, 2013 at 5:29 PM, Mauro Farracha farr...@gmail.comwrote:

It's not related to Thrift, using http also shares this behaviour.

On Tuesday, 29 October 2013 16:27:32 UTC, Mauro Farracha wrote:

Hi Honza,

Yep, I understand the issue around dependencies. The minimum you
can do is probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index
type solved the problem.

Since elasticsearch-py uses connection pooling with round-robin
by default, I was wondering if I could get more improvement if I had two
nodes up, since I would distribute the load between two servers, but using
ThriftConnection it throws an error which I don't understand why It happens
since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's an
    issue with sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request,
right? So I would end up sending one bulk action to node1 and the second
would go to node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing the
connection part in from.

I thought it was the case, I update the code so that it should
work in the future.

I changed the serializer (used ujson as Jorg mentioned) and I
got an improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid
additional dependencies, but if it makes sense I will happily switch the
client to ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance
increased to 6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't
need to do so it will always have an advantage on us I am afraid, it
shouldn't be this big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on
    the server side as when using pyes. Am I missing something? With pyes all
    the properties were there, but using elasticsearch-py, only type appears on
    the server side and are not the ones I specified. On the server log, It
    shows "update_mapping [accesslogs] (dynamic)" which doesn't happen with
    pyes. I'm sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_mapping(index=self.doc_
    collection, doc_type=self.doc_type,body={'
    **
    properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_mapping(index=self.doc_collection,
doc_type=self.doc_type,body={s
elf.doc_type:
{'properties':self.doc
mapping
**}})

  • Also, can you guys share what's your performance on a single
    local node?

I haven't done any tests like this, it varies so much with
different HW/configuration/environment that there is little value in
absolute numbers, only thing that matters is the relative speed of python
clients, curl etc.

As I mention on my first post, these are my non-default
configurations, maybe there is still room for improvement? Not to mention
of course, that these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_size: 50%
indices.memory.min_index_buffe
r_size: 300mb
index.translog.flush_threshold******: 30000
index.store.type: mmapfs
index.merge.policy.use_compoun******d_file: false

these look reasonable though I am no expert. Also when using
SSDs you might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/**elasti******csearch/life-after-ec2https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script
using elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.elast******
icsearch_conn,connection_class******=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have the
thrift python module installed and I import the class. Don't know what I'm
missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write
elasticsearch-py and made sure you can bypass the serialization by doing it
yourself. If needed you can even supply your own serializer - just create
an instance that has .dumps() and loads() methods and behaves the same as
elasticsearch.serializer.**JSONSerializer. you
can then pass it to the Elasticsearch class as an argument
(serializer=my_faster_**serializ
er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python
dictionary to pyes driver. If I send a "string" json format I could pass
the serialization? Are you familiar with pyes driver?

I saw this method signature, but don't know what's the
"header", and the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be
    ended with a newline
  • header – a json document string must be ended with
    a newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed
out the CPU or was it waiting for network? You can try serializing the
documents yourself and passing json strings to the client's bulk() method
to make sure that's not the bottle neck (you can pass in list of strings or
just one big string and we will just pass it along).

The python client does more than curl - it serializes data
and parses output, that's at least 2 cpu intensive operations that need to
happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica, which
affect a lot the indexing performance.

Jörg

--
You received this message because you are subscribed to
the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@**googlegroups.
**c**********om.

For more options, visit https://groups.google.com/**grou*
***********ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@googlegroups.
c*****om.
For more options, visit https://groups.google.com/**grou

*******ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.c**
om.
For more options, visit https://groups.google.com/**grou*****
***ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou
**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou
**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

to bypass serialization just pass in the body as a single string that's
already serialized or a list of strings

consitency will improve performance for potential risky operations, but
only if you have replicas. Same for replication if you select async it
should be a bit faster - but only for writing to replicas and for a price
of potential instability in case of problems.

On Tue, Oct 29, 2013 at 7:34 PM, Mauro Farracha farracha@gmail.com wrote:

Honza,

How can I bypass json serialization using elasticsearch-py?

From the documentation we have:
Parameters:

  • body – The operation definition and data (action-data pairs)
  • index – Default index for items which don’t provide one
  • doc_type – Default document type for items which don’t provide one
  • consistency – Explicit write consistency setting for the operation
  • refresh – Refresh the index after performing the operation
  • replication – Explicitly set the replication type (efault: sync)

Can you provide an example of a small body where it would pass
serialization?
What options can I pass on consistency and replication and how do they
affect performance?

On Tuesday, 29 October 2013 18:04:58 UTC, Mauro Farracha wrote:

Yes, multicast doesn't work as expect! defining a list of nodes did the
trick.

Having two processes each one pointing to a different node doesn't cut
it! The performance on each is almost half if only one was used.

Well, It has been a nice thread, learned lot of new stuff.

On Tuesday, 29 October 2013 17:54:16 UTC, Honza Král wrote:

ah, yep so the problem is in discovery, not the client. That's a relief
:slight_smile:

Have you disabled multicast discovery? Have you tried providing a list
of seed nodes?

On Tue, Oct 29, 2013 at 6:49 PM, Mauro Farracha farr...@gmail.comwrote:

Yes, It's installed. The problem is when initialising the second node
It doesn't discover the master node. Even the port 9201 gives me timeout.

This is the startup logging on the second node:
[2013-10-29 17:44:47,437][INFO ][node ]
version[0.90.5], pid[12297], build[c8714e8/2013-09-17T12:50:20Z]
[2013-10-29 17:44:47,437][INFO ][node ]
initializing ...
[2013-10-29 17:44:47,447][INFO ][plugins ] loaded
[transport-thrift], sites []
[2013-10-29 17:44:49,331][INFO ][node ] initialized
[2013-10-29 17:44:49,331][INFO ][node ] starting ...
[2013-10-29 17:44:49,350][INFO ][thrift ] bound on
port [9501]
[2013-10-29 17:44:49,442][INFO ][transport ]
bound_address {inet[/0:0:0:0:0:0:0:0%0:9301]
}, publish_address
{inet[/172.21.71.88:9301]}
[2013-10-29 17:45:19,483][WARN ][discovery ] waited for
30s and no initial state was set by the discovery
[2013-10-29 17:45:19,484][INFO ][discovery ]
mdf_elasticsearch/0reUSkwsQDuGpSOaf8tiwA
[2013-10-29 17:45:19,489][INFO ][http ]
bound_address {inet[/0:0:0:0:0:0:0:0%0:9201]
}, publish_address
{inet[/172.21.71.88:9201]}
[2013-10-29 17:45:19,489][INFO ][node ] started

When I try using a query on 9201, It returns
{
"error": "MasterNotDiscoveredException[**waited for [30s]]",
"status": 503
}

On Tuesday, 29 October 2013 17:33:11 UTC, Honza Král wrote:

are you sure the second node has the thrift plugin installed? Looks
like it doesn't.

Also when you are in a configuration that works, can you try .info() a
couple of times and observer the node name to see if it's changing? looks
like one of the nodes doesn't have thrift so you are only talking to one.

On Tue, Oct 29, 2013 at 6:29 PM, Mauro Farracha farr...@gmail.comwrote:

Ok.

I have two local nodes with the same cluster name configuration
(copy&paste to a new folder), the only change was that on the second (port:
9501) I set master to false, since 9500 was the master one.

This is how I connect:

self.elasticsearch_conn = ['localhost:9501'] #,'localhost:9500'

self.client = Elasticsearch(hosts=self.elasticsearch_conn,
connection_class=ThriftConnection, serializer=CJSerializer(),
sniff_on_start=True, sniff_on_connection_fail=True, sniffer_timeout=60)

Scenarios:

  • If I use two nodes, 9500 and 9501, It "works"
  • If I use 9500, It works (no exception and index properly)
  • If I use 9501, It gives me division by zero exception if using
    sniff properties, otherwise timeout when trying to index/bulk
    • The problem seems to be that second node is not available
      (don't know why). If I try to run a query on curl for 9201 It will also
      timeout

I'm pretty sure that I'm probably missing a config option somewhere
on ES config file besides the changes I mentioned above.

On Tuesday, 29 October 2013 17:18:47 UTC, Honza Král wrote:

when I just start two nodes with default settings it works just fine
for me, if I specify one or more nodes in the client. I can also easily
connect to their ports and the sniffing works for me. I cannot replicate
any of your issues other than the first one I fixed. Can you please try and
replicate it and describe how to do it?

Thanks!

On Tue, Oct 29, 2013 at 6:09 PM, Mauro Farracha farr...@gmail.comwrote:

The error doesn't show when I specified more than one node which is
good :), but if I put only one it will give a division by zero error.

But now It popped up another issue this time one ES itself. I
defined the second node to be only data and the cluster name is the same as
the master node. If I try to connect directly to the second node on 9501 or
even 9201 it gives me timeout.
How can I get a second node to work in cluster?

On Tuesday, 29 October 2013 16:46:59 UTC, Honza Král wrote:

Oops, there was a bug in the sniffing code, now fixed in master:

https://github.com/**elasticsear******ch/elasticsearch-**
py/commit/04afc03cdd6122bd8a7081f2956419866****c0bcfa1https://github.com/elasticsearch/elasticsearch-py/commit/04afc03cdd6122bd8a7081f2956419866c0bcfa1

Can you please try again with master?

Thanks!

On Tue, Oct 29, 2013 at 5:29 PM, Mauro Farracha <farr...@gmail.com

wrote:

It's not related to Thrift, using http also shares this behaviour.

On Tuesday, 29 October 2013 16:27:32 UTC, Mauro Farracha wrote:

Hi Honza,

Yep, I understand the issue around dependencies. The minimum you
can do is probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index
type solved the problem.

Since elasticsearch-py uses connection pooling with round-robin
by default, I was wondering if I could get more improvement if I had two
nodes up, since I would distribute the load between two servers, but using
ThriftConnection it throws an error which I don't understand why It happens
since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's an
    issue with sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request,
right? So I would end up sending one bulk action to node1 and the second
would go to node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing
the connection part in from.

I thought it was the case, I update the code so that it should
work in the future.

I changed the serializer (used ujson as Jorg mentioned) and I
got an improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid
additional dependencies, but if it makes sense I will happily switch the
client to ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance
increased to 6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't
need to do so it will always have an advantage on us I am afraid, it
shouldn't be this big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on
    the server side as when using pyes. Am I missing something? With pyes all
    the properties were there, but using elasticsearch-py, only type appears on
    the server side and are not the ones I specified. On the server log, It
    shows "update_mapping [accesslogs] (dynamic)" which doesn't happen with
    pyes. I'm sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_mappin******
    g(index=self.doc_*collection, doc_type=self.doc_type,body={'
    *********properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_mapping(index=self.doc_collection,
doc_type=self.doc_type,body={s
elf.doc_type:
{'properties':self.doc
mapping
****}})

  • Also, can you guys share what's your performance on a single
    local node?

I haven't done any tests like this, it varies so much with
different HW/configuration/environment that there is little value in
absolute numbers, only thing that matters is the relative speed of python
clients, curl etc.

As I mention on my first post, these are my non-default
configurations, maybe there is still room for improvement? Not to mention
of course, that these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_size: 50%
indices.memory.min_index_**buffe
r_size: 300mb
index.translog.flush_threshold
******: 30000
index.store.type: mmapfs
index.merge.policy.use_**compoun**********d_file: false

these look reasonable though I am no expert. Also when using
SSDs you might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/**elasti********csearch/life-after-ec2https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script
using elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.elast********
icsearch_conn,connection_class********=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have
the thrift python module installed and I import the class. Don't know what
I'm missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write
elasticsearch-py and made sure you can bypass the serialization by doing it
yourself. If needed you can even supply your own serializer - just create
an instance that has .dumps() and loads() methods and behaves the same as
elasticsearch.serializer.JSONSerializer. you
can then pass it to the Elasticsearch class as an argument
(serializer=my_faster_serializ
er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python
dictionary to pyes driver. If I send a "string" json format I could pass
the serialization? Are you familiar with pyes driver?

I saw this method signature, but don't know what's the
"header", and the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be
    ended with a newline
  • header – a json document string must be ended with
    a newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process
maxed out the CPU or was it waiting for network? You can try serializing
the documents yourself and passing json strings to the client's bulk()
method to make sure that's not the bottle neck (you can pass in list of
strings or just one big string and we will just pass it along).

The python client does more than curl - it serializes
data and parses output, that's at least 2 cpu intensive operations that
need to happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica, which
affect a lot the indexing performance.

Jörg

--
You received this message because you are subscribed to
the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@**
googlegroups.c**********om.

For more options, visit https://groups.google.com/**grou
**************ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to
the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@googlegroups.
c*********om.
For more options, visit https://groups.google.com/**grou
*
**********ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@**googlegroups.**c
****om.
For more options, visit https://groups.google.com/**grou

******ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.c******
om.
For more options, visit https://groups.google.com/**grou******
**ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**grou

ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou
**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.**com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Also, I couldn't figure out how to activate elasticsearch-py loggers. How
can I set them?

On Tuesday, 29 October 2013 18:34:47 UTC, Mauro Farracha wrote:

Honza,

How can I bypass json serialization using elasticsearch-py?

From the documentation we have:
Parameters:

  • body – The operation definition and data (action-data pairs)
  • index – Default index for items which don’t provide one
  • doc_type – Default document type for items which don’t provide one
  • consistency – Explicit write consistency setting for the operation
  • refresh – Refresh the index after performing the operation
  • replication – Explicitly set the replication type (efault: sync)

Can you provide an example of a small body where it would pass
serialization?
What options can I pass on consistency and replication and how do they
affect performance?

On Tuesday, 29 October 2013 18:04:58 UTC, Mauro Farracha wrote:

Yes, multicast doesn't work as expect! defining a list of nodes did the
trick.

Having two processes each one pointing to a different node doesn't cut
it! The performance on each is almost half if only one was used.

Well, It has been a nice thread, learned lot of new stuff.

On Tuesday, 29 October 2013 17:54:16 UTC, Honza Král wrote:

ah, yep so the problem is in discovery, not the client. That's a relief
:slight_smile:

Have you disabled multicast discovery? Have you tried providing a list
of seed nodes?

On Tue, Oct 29, 2013 at 6:49 PM, Mauro Farracha farr...@gmail.comwrote:

Yes, It's installed. The problem is when initialising the second node
It doesn't discover the master node. Even the port 9201 gives me timeout.

This is the startup logging on the second node:
[2013-10-29 17:44:47,437][INFO ][node ]
version[0.90.5], pid[12297], build[c8714e8/2013-09-17T12:50:20Z]
[2013-10-29 17:44:47,437][INFO ][node ]
initializing ...
[2013-10-29 17:44:47,447][INFO ][plugins ] loaded
[transport-thrift], sites
[2013-10-29 17:44:49,331][INFO ][node ] initialized
[2013-10-29 17:44:49,331][INFO ][node ] starting ...
[2013-10-29 17:44:49,350][INFO ][thrift ] bound on
port [9501]
[2013-10-29 17:44:49,442][INFO ][transport ]
bound_address {inet[/0:0:0:0:0:0:0:0%0:9301]}, publish_address {inet[/
172.21.71.88:9301]}
[2013-10-29 17:45:19,483][WARN ][discovery ] waited for
30s and no initial state was set by the discovery
[2013-10-29 17:45:19,484][INFO ][discovery ]
mdf_elasticsearch/0reUSkwsQDuGpSOaf8tiwA
[2013-10-29 17:45:19,489][INFO ][http ]
bound_address {inet[/0:0:0:0:0:0:0:0%0:9201]}, publish_address {inet[/
172.21.71.88:9201]}
[2013-10-29 17:45:19,489][INFO ][node ] started

When I try using a query on 9201, It returns
{
"error": "MasterNotDiscoveredException[waited for [30s]]",
"status": 503
}

On Tuesday, 29 October 2013 17:33:11 UTC, Honza Král wrote:

are you sure the second node has the thrift plugin installed? Looks
like it doesn't.

Also when you are in a configuration that works, can you try .info() a
couple of times and observer the node name to see if it's changing? looks
like one of the nodes doesn't have thrift so you are only talking to one.

On Tue, Oct 29, 2013 at 6:29 PM, Mauro Farracha farr...@gmail.comwrote:

Ok.

I have two local nodes with the same cluster name configuration
(copy&paste to a new folder), the only change was that on the second (port:
9501) I set master to false, since 9500 was the master one.

This is how I connect:

self.elasticsearch_conn = ['localhost:9501'] #,'localhost:9500'

self.client = Elasticsearch(hosts=self.**elasticsearch_conn,
connection_class=**ThriftConnection, serializer=CJSerializer(),
sniff_on_start=True, sniff_on_connection_fail=True, sniffer_timeout=60)

Scenarios:

  • If I use two nodes, 9500 and 9501, It "works"
  • If I use 9500, It works (no exception and index properly)
  • If I use 9501, It gives me division by zero exception if using
    sniff properties, otherwise timeout when trying to index/bulk
    • The problem seems to be that second node is not available
      (don't know why). If I try to run a query on curl for 9201 It will also
      timeout

I'm pretty sure that I'm probably missing a config option somewhere
on ES config file besides the changes I mentioned above.

On Tuesday, 29 October 2013 17:18:47 UTC, Honza Král wrote:

when I just start two nodes with default settings it works just fine
for me, if I specify one or more nodes in the client. I can also easily
connect to their ports and the sniffing works for me. I cannot replicate
any of your issues other than the first one I fixed. Can you please try and
replicate it and describe how to do it?

Thanks!

On Tue, Oct 29, 2013 at 6:09 PM, Mauro Farracha farr...@gmail.comwrote:

The error doesn't show when I specified more than one node which is
good :), but if I put only one it will give a division by zero error.

But now It popped up another issue this time one ES itself. I
defined the second node to be only data and the cluster name is the same as
the master node. If I try to connect directly to the second node on 9501 or
even 9201 it gives me timeout.
How can I get a second node to work in cluster?

On Tuesday, 29 October 2013 16:46:59 UTC, Honza Král wrote:

Oops, there was a bug in the sniffing code, now fixed in master:

https://github.com/**elasticsear****ch/elasticsearch-**py/commit/*
04afc03cdd6122bd8a7081f2956419**866*c0bcfa1https://github.com/elasticsearch/elasticsearch-py/commit/04afc03cdd6122bd8a7081f2956419866c0bcfa1

Can you please try again with master?

Thanks!

On Tue, Oct 29, 2013 at 5:29 PM, Mauro Farracha <farr...@gmail.com

wrote:

It's not related to Thrift, using http also shares this behaviour.

On Tuesday, 29 October 2013 16:27:32 UTC, Mauro Farracha wrote:

Hi Honza,

Yep, I understand the issue around dependencies. The minimum you
can do is probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index
type solved the problem.

Since elasticsearch-py uses connection pooling with round-robin
by default, I was wondering if I could get more improvement if I had two
nodes up, since I would distribute the load between two servers, but using
ThriftConnection it throws an error which I don't understand why It happens
since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's an
    issue with sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request,
right? So I would end up sending one bulk action to node1 and the second
would go to node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing
the connection part in from.

I thought it was the case, I update the code so that it should
work in the future.

I changed the serializer (used ujson as Jorg mentioned) and I
got an improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid
additional dependencies, but if it makes sense I will happily switch the
client to ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance
increased to 6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't
need to do so it will always have an advantage on us I am afraid, it
shouldn't be this big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on
    the server side as when using pyes. Am I missing something? With pyes all
    the properties were there, but using elasticsearch-py, only type appears on
    the server side and are not the ones I specified. On the server log, It
    shows "update_mapping [accesslogs] (dynamic)" which doesn't happen with
    pyes. I'm sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_mappin****
    g(index=self.doc_*collection, doc_type=self.doc_type,body={'
    *******properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_mapping(index=self.doc_collection,
doc_type=self.doc_type,body={s
elf.doc_type:
{'properties':self.doc
mapping
**}})

  • Also, can you guys share what's your performance on a single
    local node?

I haven't done any tests like this, it varies so much with
different HW/configuration/environment that there is little value in
absolute numbers, only thing that matters is the relative speed of python
clients, curl etc.

As I mention on my first post, these are my non-default
configurations, maybe there is still room for improvement? Not to mention
of course, that these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_size: 50%
indices.memory.min_index_buffe
r_size: 300mb
index.translog.flush_threshold******: 30000
index.store.type: mmapfs
index.merge.policy.use_compoun******d_file: false

these look reasonable though I am no expert. Also when using
SSDs you might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/**elasti******csearch/life-after-ec2https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script
using elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.elast******
icsearch_conn,connection_class******=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have
the thrift python module installed and I import the class. Don't know what
I'm missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write
elasticsearch-py and made sure you can bypass the serialization by doing it
yourself. If needed you can even supply your own serializer - just create
an instance that has .dumps() and loads() methods and behaves the same as
elasticsearch.serializer.**JSONSerializer. you
can then pass it to the Elasticsearch class as an argument
(serializer=my_faster_**serializ
er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python
dictionary to pyes driver. If I send a "string" json format I could pass
the serialization? Are you familiar with pyes driver?

I saw this method signature, but don't know what's the
"header", and the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be
    ended with a newline
  • header – a json document string must be ended with
    a newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process
maxed out the CPU or was it waiting for network? You can try serializing
the documents yourself and passing json strings to the client's bulk()
method to make sure that's not the bottle neck (you can pass in list of
strings or just one big string and we will just pass it along).

The python client does more than curl - it serializes
data and parses output, that's at least 2 cpu intensive operations that
need to happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica, which
affect a lot the indexing performance.

Jörg

--
You received this message because you are subscribed to
the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@**
googlegroups.**c**********om.

For more options, visit https://groups.google.com/**grou
************ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to
the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@googlegroups.
*c
*****om.
For more options, visit https://groups.google.com/**grou

********ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@**googlegroups.**c
**om.
For more options, visit https://groups.google.com/**grou

****ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou
**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou
**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**grou

ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

elasticsearch-py uses standard python logging library - just configure it
in any way you wish and define some handlers for the loggers, for example:

from logging.config import dictConfig

CONF = {
'handlers': {
'console': {
'class': 'logging.StreamHandler',
}
},
'loggers': {
'elasticsearch': {'handlers': ['console'], 'level': 'DEBUG'},
'elasticsearch.trace': {'handlers': ['console'], 'level': 'DEBUG'}
},
'version': 1
}
dictConfig(CONF)

On Tue, Oct 29, 2013 at 7:43 PM, Mauro Farracha farracha@gmail.com wrote:

Also, I couldn't figure out how to activate elasticsearch-py loggers. How
can I set them?

On Tuesday, 29 October 2013 18:34:47 UTC, Mauro Farracha wrote:

Honza,

How can I bypass json serialization using elasticsearch-py?

From the documentation we have:
Parameters:

  • body – The operation definition and data (action-data pairs)
  • index – Default index for items which don’t provide one
  • doc_type – Default document type for items which don’t provide one
  • consistency – Explicit write consistency setting for the operation
  • refresh – Refresh the index after performing the operation
  • replication – Explicitly set the replication type (efault: sync)

Can you provide an example of a small body where it would pass
serialization?
What options can I pass on consistency and replication and how do they
affect performance?

On Tuesday, 29 October 2013 18:04:58 UTC, Mauro Farracha wrote:

Yes, multicast doesn't work as expect! defining a list of nodes did the
trick.

Having two processes each one pointing to a different node doesn't cut
it! The performance on each is almost half if only one was used.

Well, It has been a nice thread, learned lot of new stuff.

On Tuesday, 29 October 2013 17:54:16 UTC, Honza Král wrote:

ah, yep so the problem is in discovery, not the client. That's a relief
:slight_smile:

Have you disabled multicast discovery? Have you tried providing a list
of seed nodes?

On Tue, Oct 29, 2013 at 6:49 PM, Mauro Farracha farr...@gmail.comwrote:

Yes, It's installed. The problem is when initialising the second node
It doesn't discover the master node. Even the port 9201 gives me timeout.

This is the startup logging on the second node:
[2013-10-29 17:44:47,437][INFO ][node ]
version[0.90.5], pid[12297], build[c8714e8/2013-09-17T12:50:20Z]
[2013-10-29 17:44:47,437][INFO ][node ]
initializing ...
[2013-10-29 17:44:47,447][INFO ][plugins ] loaded
[transport-thrift], sites []
[2013-10-29 17:44:49,331][INFO ][node ] initialized
[2013-10-29 17:44:49,331][INFO ][node ] starting
...
[2013-10-29 17:44:49,350][INFO ][thrift ] bound on
port [9501]
[2013-10-29 17:44:49,442][INFO ][transport ]
bound_address {inet[/0:0:0:0:0:0:0:0%0:9301]
}, publish_address
{inet[/172.21.71.88:9301]}
[2013-10-29 17:45:19,483][WARN ][discovery ] waited for
30s and no initial state was set by the discovery
[2013-10-29 17:45:19,484][INFO ][discovery ]
mdf_elasticsearch/0reUSkwsQDuGpSOaf8tiwA
[2013-10-29 17:45:19,489][INFO ][http ]
bound_address {inet[/0:0:0:0:0:0:0:0%0:9201]
}, publish_address
{inet[/172.21.71.88:9201]}
[2013-10-29 17:45:19,489][INFO ][node ] started

When I try using a query on 9201, It returns
{
"error": "MasterNotDiscoveredException[**waited for [30s]]",
"status": 503
}

On Tuesday, 29 October 2013 17:33:11 UTC, Honza Král wrote:

are you sure the second node has the thrift plugin installed? Looks
like it doesn't.

Also when you are in a configuration that works, can you try .info()
a couple of times and observer the node name to see if it's changing? looks
like one of the nodes doesn't have thrift so you are only talking to one.

On Tue, Oct 29, 2013 at 6:29 PM, Mauro Farracha farr...@gmail.comwrote:

Ok.

I have two local nodes with the same cluster name configuration
(copy&paste to a new folder), the only change was that on the second (port:
9501) I set master to false, since 9500 was the master one.

This is how I connect:

self.elasticsearch_conn = ['localhost:9501'] #,'localhost:9500'

self.client = Elasticsearch(hosts=self.elasticsearch_conn,
connection_class=ThriftConnection, serializer=CJSerializer(),
sniff_on_start=True, sniff_on_connection_fail=True, sniffer_timeout=60)

Scenarios:

  • If I use two nodes, 9500 and 9501, It "works"
  • If I use 9500, It works (no exception and index properly)
  • If I use 9501, It gives me division by zero exception if using
    sniff properties, otherwise timeout when trying to index/bulk
    • The problem seems to be that second node is not available
      (don't know why). If I try to run a query on curl for 9201 It will also
      timeout

I'm pretty sure that I'm probably missing a config option somewhere
on ES config file besides the changes I mentioned above.

On Tuesday, 29 October 2013 17:18:47 UTC, Honza Král wrote:

when I just start two nodes with default settings it works just
fine for me, if I specify one or more nodes in the client. I can also
easily connect to their ports and the sniffing works for me. I cannot
replicate any of your issues other than the first one I fixed. Can you
please try and replicate it and describe how to do it?

Thanks!

On Tue, Oct 29, 2013 at 6:09 PM, Mauro Farracha farr...@gmail.comwrote:

The error doesn't show when I specified more than one node which
is good :), but if I put only one it will give a division by zero error.

But now It popped up another issue this time one ES itself. I
defined the second node to be only data and the cluster name is the same as
the master node. If I try to connect directly to the second node on 9501 or
even 9201 it gives me timeout.
How can I get a second node to work in cluster?

On Tuesday, 29 October 2013 16:46:59 UTC, Honza Král wrote:

Oops, there was a bug in the sniffing code, now fixed in master:

https://github.com/**elasticsear******ch/elasticsearch-**
py/commit/04afc03cdd6122bd8a7081f2956419866****
c0bcfa1https://github.com/elasticsearch/elasticsearch-py/commit/04afc03cdd6122bd8a7081f2956419866c0bcfa1

Can you please try again with master?

Thanks!

On Tue, Oct 29, 2013 at 5:29 PM, Mauro Farracha <
farr...@gmail.com> wrote:

It's not related to Thrift, using http also shares this
behaviour.

On Tuesday, 29 October 2013 16:27:32 UTC, Mauro Farracha wrote:

Hi Honza,

Yep, I understand the issue around dependencies. The minimum
you can do is probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index
type solved the problem.

Since elasticsearch-py uses connection pooling with round-robin
by default, I was wondering if I could get more improvement if I had two
nodes up, since I would distribute the load between two servers, but using
ThriftConnection it throws an error which I don't understand why It happens
since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's an
    issue with sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request,
right? So I would end up sending one bulk action to node1 and the second
would go to node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing
the connection part in from.

I thought it was the case, I update the code so that it should
work in the future.

I changed the serializer (used ujson as Jorg mentioned) and I
got an improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid
additional dependencies, but if it makes sense I will happily switch the
client to ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance
increased to 6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't
need to do so it will always have an advantage on us I am afraid, it
shouldn't be this big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on
    the server side as when using pyes. Am I missing something? With pyes all
    the properties were there, but using elasticsearch-py, only type appears on
    the server side and are not the ones I specified. On the server log, It
    shows "update_mapping [accesslogs] (dynamic)" which doesn't happen with
    pyes. I'm sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_mappin******
    g(index=self.doc_**collection, doc_type=self.doc_type,body={'
    **********properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_mapping(index=self.doc_collection,
doc_type=self.doc_type,body={s
elf.doc_type:
{'properties':self.doc
mapping
****}})

  • Also, can you guys share what's your performance on a
    single local node?

I haven't done any tests like this, it varies so much with
different HW/configuration/environment that there is little value in
absolute numbers, only thing that matters is the relative speed of python
clients, curl etc.

As I mention on my first post, these are my non-default
configurations, maybe there is still room for improvement? Not to mention
of course, that these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_size: 50%
indices.memory.min_index_**buffe
r_size: 300mb
index.translog.flush_threshold
******: 30000
index.store.type: mmapfs
index.merge.policy.use_**compoun**********d_file: false

these look reasonable though I am no expert. Also when using
SSDs you might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/**elasti********csearch/life-after-ec2https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script
using elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.elast********
icsearch_conn,connection_class********
=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have
the thrift python module installed and I import the class. Don't know what
I'm missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write
elasticsearch-py and made sure you can bypass the serialization by doing it
yourself. If needed you can even supply your own serializer - just create
an instance that has .dumps() and loads() methods and behaves the same as
elasticsearch.serializer.JSONSerializer.
you can then pass it to the Elasticsearch class as an argument
(serializer=my_faster_serializ
er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python
dictionary to pyes driver. If I send a "string" json format I could pass
the serialization? Are you familiar with pyes driver?

I saw this method signature, but don't know what's the
"header", and the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be
    ended with a newline
  • header – a json document string must be ended
    with a newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král
wrote:

Hi,

and what was the bottle neck? Has the pyhton process
maxed out the CPU or was it waiting for network? You can try serializing
the documents yourself and passing json strings to the client's bulk()
method to make sure that's not the bottle neck (you can pass in list of
strings or just one big string and we will just pass it along).

The python client does more than curl - it serializes
data and parses output, that's at least 2 cpu intensive operations that
need to happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica,
which affect a lot the indexing performance.

Jörg

--
You received this message because you are subscribed to
the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving
emails from it, send an email to elasticsearc...@**
googlegroups.c**********om.

For more options, visit https://groups.google.com/**
grou**************ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to
the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@**googlegroups.
**c*********om.
For more options, visit https://groups.google.com/**grou

***********ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@googlegroups.
c*****om.
For more options, visit https://groups.google.com/**grou

*******ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.c******
om.
For more options, visit https://groups.google.com/**grou*****
***ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**grou

ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou
**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.**com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the logging information, I think I can manage from here.

Is there any possibility of bypassing the "{index:{}}" before each
document? I would like just to send a list of documents instead an
interleaved list with {index:{}}" in the middle. If one wanted to change
from index to delete or update, It would specify it in the bulk operation.
I don't see the need why ES has to specify the index operation before each
document.

On Tuesday, 29 October 2013 19:14:10 UTC, Honza Král wrote:

elasticsearch-py uses standard python logging library - just configure it
in any way you wish and define some handlers for the loggers, for example:

from logging.config import dictConfig

CONF = {
'handlers': {
'console': {
'class': 'logging.StreamHandler',
}
},
'loggers': {
'elasticsearch': {'handlers': ['console'], 'level': 'DEBUG'},
'elasticsearch.trace': {'handlers': ['console'], 'level': 'DEBUG'}
},
'version': 1
}
dictConfig(CONF)

On Tue, Oct 29, 2013 at 7:43 PM, Mauro Farracha <farr...@gmail.com<javascript:>

wrote:

Also, I couldn't figure out how to activate elasticsearch-py loggers. How
can I set them?

On Tuesday, 29 October 2013 18:34:47 UTC, Mauro Farracha wrote:

Honza,

How can I bypass json serialization using elasticsearch-py?

From the documentation we have:
Parameters:

  • body – The operation definition and data (action-data pairs)
  • index – Default index for items which don’t provide one
  • doc_type – Default document type for items which don’t provide
    one
  • consistency – Explicit write consistency setting for the
    operation
  • refresh – Refresh the index after performing the operation
  • replication – Explicitly set the replication type (efault: sync)

Can you provide an example of a small body where it would pass
serialization?
What options can I pass on consistency and replication and how do they
affect performance?

On Tuesday, 29 October 2013 18:04:58 UTC, Mauro Farracha wrote:

Yes, multicast doesn't work as expect! defining a list of nodes did the
trick.

Having two processes each one pointing to a different node doesn't cut
it! The performance on each is almost half if only one was used.

Well, It has been a nice thread, learned lot of new stuff.

On Tuesday, 29 October 2013 17:54:16 UTC, Honza Král wrote:

ah, yep so the problem is in discovery, not the client. That's a
relief :slight_smile:

Have you disabled multicast discovery? Have you tried providing a list
of seed nodes?

On Tue, Oct 29, 2013 at 6:49 PM, Mauro Farracha farr...@gmail.comwrote:

Yes, It's installed. The problem is when initialising the second node
It doesn't discover the master node. Even the port 9201 gives me timeout.

This is the startup logging on the second node:
[2013-10-29 17:44:47,437][INFO ][node ]
version[0.90.5], pid[12297], build[c8714e8/2013-09-17T12:50:20Z]
[2013-10-29 17:44:47,437][INFO ][node ]
initializing ...
[2013-10-29 17:44:47,447][INFO ][plugins ] loaded
[transport-thrift], sites []
[2013-10-29 17:44:49,331][INFO ][node ]
initialized
[2013-10-29 17:44:49,331][INFO ][node ] starting
...
[2013-10-29 17:44:49,350][INFO ][thrift ] bound on
port [9501]
[2013-10-29 17:44:49,442][INFO ][transport ]
bound_address {inet[/0:0:0:0:0:0:0:0%0:9301]
}, publish_address
{inet[/172.21.71.88:9301]}
[2013-10-29 17:45:19,483][WARN ][discovery ] waited
for 30s and no initial state was set by the discovery
[2013-10-29 17:45:19,484][INFO ][discovery ]
mdf_elasticsearch/0reUSkwsQDuGpSOaf8tiwA
[2013-10-29 17:45:19,489][INFO ][http ]
bound_address {inet[/0:0:0:0:0:0:0:0%0:9201]
}, publish_address
{inet[/172.21.71.88:9201]}
[2013-10-29 17:45:19,489][INFO ][node ] started

When I try using a query on 9201, It returns
{
"error": "MasterNotDiscoveredException[**waited for [30s]]",
"status": 503
}

On Tuesday, 29 October 2013 17:33:11 UTC, Honza Král wrote:

are you sure the second node has the thrift plugin installed? Looks
like it doesn't.

Also when you are in a configuration that works, can you try .info()
a couple of times and observer the node name to see if it's changing? looks
like one of the nodes doesn't have thrift so you are only talking to one.

On Tue, Oct 29, 2013 at 6:29 PM, Mauro Farracha farr...@gmail.comwrote:

Ok.

I have two local nodes with the same cluster name configuration
(copy&paste to a new folder), the only change was that on the second (port:
9501) I set master to false, since 9500 was the master one.

This is how I connect:

self.elasticsearch_conn = ['localhost:9501'] #,'localhost:9500'

self.client = Elasticsearch(hosts=self.elasticsearch_conn,
connection_class=ThriftConnection, serializer=CJSerializer(),
sniff_on_start=True, sniff_on_connection_fail=True, sniffer_timeout=60)

Scenarios:

  • If I use two nodes, 9500 and 9501, It "works"
  • If I use 9500, It works (no exception and index properly)
  • If I use 9501, It gives me division by zero exception if using
    sniff properties, otherwise timeout when trying to index/bulk
    • The problem seems to be that second node is not available
      (don't know why). If I try to run a query on curl for 9201 It will also
      timeout

I'm pretty sure that I'm probably missing a config option somewhere
on ES config file besides the changes I mentioned above.

On Tuesday, 29 October 2013 17:18:47 UTC, Honza Král wrote:

when I just start two nodes with default settings it works just
fine for me, if I specify one or more nodes in the client. I can also
easily connect to their ports and the sniffing works for me. I cannot
replicate any of your issues other than the first one I fixed. Can you
please try and replicate it and describe how to do it?

Thanks!

On Tue, Oct 29, 2013 at 6:09 PM, Mauro Farracha <farr...@gmail.com

wrote:

The error doesn't show when I specified more than one node which
is good :), but if I put only one it will give a division by zero error.

But now It popped up another issue this time one ES itself. I
defined the second node to be only data and the cluster name is the same as
the master node. If I try to connect directly to the second node on 9501 or
even 9201 it gives me timeout.
How can I get a second node to work in cluster?

On Tuesday, 29 October 2013 16:46:59 UTC, Honza Král wrote:

Oops, there was a bug in the sniffing code, now fixed in master:

https://github.com/**elasticsear******ch/elasticsearch-**
py/commit/04afc03cdd6122bd8a7081f2956419866****
c0bcfa1https://github.com/elasticsearch/elasticsearch-py/commit/04afc03cdd6122bd8a7081f2956419866c0bcfa1

Can you please try again with master?

Thanks!

On Tue, Oct 29, 2013 at 5:29 PM, Mauro Farracha <
farr...@gmail.com> wrote:

It's not related to Thrift, using http also shares this
behaviour.

On Tuesday, 29 October 2013 16:27:32 UTC, Mauro Farracha wrote:

Hi Honza,

Yep, I understand the issue around dependencies. The minimum
you can do is probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index
type solved the problem.

Since elasticsearch-py uses connection pooling with
round-robin by default, I was wondering if I could get more improvement if
I had two nodes up, since I would distribute the load between two servers,
but using ThriftConnection it throws an error which I don't understand why
It happens since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's
    an issue with sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request,
right? So I would end up sending one bulk action to node1 and the second
would go to node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing
the connection part in from.

I thought it was the case, I update the code so that it
should work in the future.

I changed the serializer (used ujson as Jorg mentioned) and
I got an improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid
additional dependencies, but if it makes sense I will happily switch the
client to ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance
increased to 6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't
need to do so it will always have an advantage on us I am afraid, it
shouldn't be this big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same
    on the server side as when using pyes. Am I missing something? With pyes
    all the properties were there, but using elasticsearch-py, only type
    appears on the server side and are not the ones I specified. On the server
    log, It shows "update_mapping [accesslogs] (dynamic)" which doesn't happen
    with pyes. I'm sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_mappin******
    g(index=self.doc_collection,
    doc_type=self.doc_type,body={'
    ********
    properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_mapping(index=self.doc_*
*collection, doc_type=self.doc_type,body={s
elf.doc_type:
{'properties':self.doc
mapping******}})

  • Also, can you guys share what's your performance on a
    single local node?

I haven't done any tests like this, it varies so much with
different HW/configuration/environment that there is little value in
absolute numbers, only thing that matters is the relative speed of python
clients, curl etc.

As I mention on my first post, these are my non-default
configurations, maybe there is still room for improvement? Not to mention
of course, that these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_size: 50%
indices.memory.min_index_**buffe
r_size: 300mb
index.translog.flush_threshold
******: 30000
index.store.type: mmapfs
index.merge.policy.use_**compoun**********d_file: false

these look reasonable though I am no expert. Also when using
SSDs you might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/**elasti********
csearch/life-after-ec2https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script
using elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.elast********
icsearch_conn,connection_class********
=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have
the thrift python module installed and I import the class. Don't know what
I'm missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write
elasticsearch-py and made sure you can bypass the serialization by doing it
yourself. If needed you can even supply your own serializer - just create
an instance that has .dumps() and loads() methods and behaves the same as
elasticsearch.serializer.JSONSerializer.
you can then pass it to the Elasticsearch class as an argument
(serializer=my_faster_serializ
er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python
dictionary to pyes driver. If I send a "string" json format I could pass
the serialization? Are you familiar with pyes driver?

I saw this method signature, but don't know what's the
"header", and the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be
    ended with a newline
  • header – a json document string must be ended
    with a newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král
wrote:

Hi,

and what was the bottle neck? Has the pyhton process
maxed out the CPU or was it waiting for network? You can try serializing
the documents yourself and passing json strings to the client's bulk()
method to make sure that's not the bottle neck (you can pass in list of
strings or just one big string and we will just pass it along).

The python client does more than curl - it serializes
data and parses output, that's at least 2 cpu intensive operations that
need to happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica,
which affect a lot the indexing performance.

Jörg

--
You received this message because you are subscribed
to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving
emails from it, send an email to elasticsearc...@**
googlegroups.c**********om.

For more options, visit https://groups.google.com/**
grou**************ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to
the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@**
googlegroups.**c**********om.
For more options, visit https://groups.google.com/**grou
************ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to
the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@googlegroups.
*c
*****om.
For more options, visit https://groups.google.com/**grou

********ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@googlegroups.c
**om.
For more options, visit https://groups.google.com/**grou

****ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**grou

ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou
**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**grou

ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.**com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

from elasticsearch.helpers import bulk_index

On Tue, Oct 29, 2013 at 8:24 PM, Mauro Farracha farracha@gmail.com wrote:

Thanks for the logging information, I think I can manage from here.

Is there any possibility of bypassing the "{index:{}}" before each
document? I would like just to send a list of documents instead an
interleaved list with {index:{}}" in the middle. If one wanted to change
from index to delete or update, It would specify it in the bulk operation.
I don't see the need why ES has to specify the index operation before each
document.

On Tuesday, 29 October 2013 19:14:10 UTC, Honza Král wrote:

elasticsearch-py uses standard python logging library - just configure it
in any way you wish and define some handlers for the loggers, for example:

from logging.config import dictConfig

CONF = {
'handlers': {
'console': {
'class': 'logging.StreamHandler',
}
},
'loggers': {
'elasticsearch': {'handlers': ['console'], 'level': 'DEBUG'},
'elasticsearch.trace': {'handlers': ['console'], 'level': 'DEBUG'}
},
'version': 1
}
dictConfig(CONF)

On Tue, Oct 29, 2013 at 7:43 PM, Mauro Farracha farr...@gmail.comwrote:

Also, I couldn't figure out how to activate elasticsearch-py loggers.
How can I set them?

On Tuesday, 29 October 2013 18:34:47 UTC, Mauro Farracha wrote:

Honza,

How can I bypass json serialization using elasticsearch-py?

From the documentation we have:
Parameters:

  • body – The operation definition and data (action-data pairs)
  • index – Default index for items which don’t provide one
  • doc_type – Default document type for items which don’t provide
    one
  • consistency – Explicit write consistency setting for the
    operation
  • refresh – Refresh the index after performing the operation
  • replication – Explicitly set the replication type (efault: sync)

Can you provide an example of a small body where it would pass
serialization?
What options can I pass on consistency and replication and how do they
affect performance?

On Tuesday, 29 October 2013 18:04:58 UTC, Mauro Farracha wrote:

Yes, multicast doesn't work as expect! defining a list of nodes did
the trick.

Having two processes each one pointing to a different node doesn't cut
it! The performance on each is almost half if only one was used.

Well, It has been a nice thread, learned lot of new stuff.

On Tuesday, 29 October 2013 17:54:16 UTC, Honza Král wrote:

ah, yep so the problem is in discovery, not the client. That's a
relief :slight_smile:

Have you disabled multicast discovery? Have you tried providing a
list of seed nodes?

On Tue, Oct 29, 2013 at 6:49 PM, Mauro Farracha farr...@gmail.comwrote:

Yes, It's installed. The problem is when initialising the second
node It doesn't discover the master node. Even the port 9201 gives me
timeout.

This is the startup logging on the second node:
[2013-10-29 17:44:47,437][INFO ][node ]
version[0.90.5], pid[12297], build[c8714e8/2013-09-17T12:50:20Z]
[2013-10-29 17:44:47,437][INFO ][node ]
initializing ...
[2013-10-29 17:44:47,447][INFO ][plugins ] loaded
[transport-thrift], sites
[2013-10-29 17:44:49,331][INFO ][node ]
initialized
[2013-10-29 17:44:49,331][INFO ][node ] starting
...
[2013-10-29 17:44:49,350][INFO ][thrift ] bound on
port [9501]
[2013-10-29 17:44:49,442][INFO ][transport ]
bound_address {inet[/0:0:0:0:0:0:0:0%0:9301]}, publish_address
{inet[/172.21.71.88:9301]}
[2013-10-29 17:45:19,483][WARN ][discovery ] waited
for 30s and no initial state was set by the discovery
[2013-10-29 17:45:19,484][INFO ][discovery ]
mdf_elasticsearch/0reUSkwsQDuGpSOaf8tiwA
[2013-10-29 17:45:19,489][INFO ][http ]
bound_address {inet[/0:0:0:0:0:0:0:0%0:9201]
}, publish_address
{inet[/172.21.71.88:9201]}
[2013-10-29 17:45:19,489][INFO ][node ] started

When I try using a query on 9201, It returns
{
"error": "MasterNotDiscoveredException[****waited for [30s]]",
"status": 503
}

On Tuesday, 29 October 2013 17:33:11 UTC, Honza Král wrote:

are you sure the second node has the thrift plugin installed? Looks
like it doesn't.

Also when you are in a configuration that works, can you try
.info() a couple of times and observer the node name to see if it's
changing? looks like one of the nodes doesn't have thrift so you are only
talking to one.

On Tue, Oct 29, 2013 at 6:29 PM, Mauro Farracha farr...@gmail.comwrote:

Ok.

I have two local nodes with the same cluster name configuration
(copy&paste to a new folder), the only change was that on the second (port:
9501) I set master to false, since 9500 was the master one.

This is how I connect:

self.elasticsearch_conn = ['localhost:9501'] #,'localhost:9500'

self.client = Elasticsearch(hosts=self.**elasticsearch_conn,
connection_class=**ThriftConnect
ion,
serializer=CJSerializer(), sniff_on_start=True,
sniff_on_connection_fail=True, sniffer_timeout=60)

Scenarios:

  • If I use two nodes, 9500 and 9501, It "works"
  • If I use 9500, It works (no exception and index properly)
  • If I use 9501, It gives me division by zero exception if using
    sniff properties, otherwise timeout when trying to index/bulk
    • The problem seems to be that second node is not available
      (don't know why). If I try to run a query on curl for 9201 It will also
      timeout

I'm pretty sure that I'm probably missing a config option
somewhere on ES config file besides the changes I mentioned above.

On Tuesday, 29 October 2013 17:18:47 UTC, Honza Král wrote:

when I just start two nodes with default settings it works just
fine for me, if I specify one or more nodes in the client. I can also
easily connect to their ports and the sniffing works for me. I cannot
replicate any of your issues other than the first one I fixed. Can you
please try and replicate it and describe how to do it?

Thanks!

On Tue, Oct 29, 2013 at 6:09 PM, Mauro Farracha <
farr...@gmail.com> wrote:

The error doesn't show when I specified more than one node which
is good :), but if I put only one it will give a division by zero error.

But now It popped up another issue this time one ES itself. I
defined the second node to be only data and the cluster name is the same as
the master node. If I try to connect directly to the second node on 9501 or
even 9201 it gives me timeout.
How can I get a second node to work in cluster?

On Tuesday, 29 October 2013 16:46:59 UTC, Honza Král wrote:

Oops, there was a bug in the sniffing code, now fixed in master:

https://github.com/**elasticsear********ch/elasticsearch-**
py/commit/04afc03cdd6122bd8a7081f2956419866****
c0bcfa1https://github.com/elasticsearch/elasticsearch-py/commit/04afc03cdd6122bd8a7081f2956419866c0bcfa1

Can you please try again with master?

Thanks!

On Tue, Oct 29, 2013 at 5:29 PM, Mauro Farracha <
farr...@gmail.com> wrote:

It's not related to Thrift, using http also shares this
behaviour.

On Tuesday, 29 October 2013 16:27:32 UTC, Mauro Farracha wrote:

Hi Honza,

Yep, I understand the issue around dependencies. The minimum
you can do is probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index
type solved the problem.

Since elasticsearch-py uses connection pooling with
round-robin by default, I was wondering if I could get more improvement if
I had two nodes up, since I would distribute the load between two servers,
but using ThriftConnection it throws an error which I don't understand why
It happens since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's
    an issue with sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request,
right? So I would end up sending one bulk action to node1 and the second
would go to node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing
the connection part in from.

I thought it was the case, I update the code so that it
should work in the future.

I changed the serializer (used ujson as Jorg mentioned) and
I got an improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid
additional dependencies, but if it makes sense I will happily switch the
client to ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write
performance increased to 6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't
need to do so it will always have an advantage on us I am afraid, it
shouldn't be this big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same
    on the server side as when using pyes. Am I missing something? With pyes
    all the properties were there, but using elasticsearch-py, only type
    appears on the server side and are not the ones I specified. On the server
    log, It shows "update_mapping [accesslogs] (dynamic)" which doesn't happen
    with pyes. I'm sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_mappin********
    g(index=self.doc_collection,
    doc_type=self.doc_type,body={'
    **********
    properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_mappin********
g(index=self.doc_**collection, doc_type=self.doc_type,body={
**s**********elf.doc_type: {'properties':self.doc**mapping
**********}})

  • Also, can you guys share what's your performance on a
    single local node?

I haven't done any tests like this, it varies so much with
different HW/configuration/environment that there is little value in
absolute numbers, only thing that matters is the relative speed of python
clients, curl etc.

As I mention on my first post, these are my non-default
configurations, maybe there is still room for improvement? Not to mention
of course, that these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_size: 50%
indices.memory.min_index_buffe
r_size: 300mb
index.translog.flush_threshold**********: 30000
index.store.type: mmapfs
index.merge.policy.use_compoun**********d_file: false

these look reasonable though I am no expert. Also when using
SSDs you might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/**elasti**********
csearch/life-after-ec2https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script
using elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.elast*********
icsearch_conn,connection_class*********
=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have
the thrift python module installed and I import the class. Don't know what
I'm missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král
wrote:

I am not familiar with pyes, I did however write
elasticsearch-py and made sure you can bypass the serialization by doing it
yourself. If needed you can even supply your own serializer - just create
an instance that has .dumps() and loads() methods and behaves the same as
elasticsearch.serializer.JSONSerializer.
you can then pass it to the Elasticsearch class as an argument
(serializer=my_faster_serializ
er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha <
farr...@gmail.com> wrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python
dictionary to pyes driver. If I send a "string" json format I could pass
the serialization? Are you familiar with pyes driver?

I saw this method signature, but don't know what's the
"header", and the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be
    ended with a newline
  • header – a json document string must be ended
    with a newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král
wrote:

Hi,

and what was the bottle neck? Has the pyhton process
maxed out the CPU or was it waiting for network? You can try serializing
the documents yourself and passing json strings to the client's bulk()
method to make sure that's not the bottle neck (you can pass in list of
strings or just one big string and we will just pass it along).

The python client does more than curl - it serializes
data and parses output, that's at least 2 cpu intensive operations that
need to happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica,
which affect a lot the indexing performance.

Jörg

--
You received this message because you are subscribed
to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving
emails from it, send an email to elasticsearc...@**
googlegroups.c************om.

For more options, visit https://groups.google.com/**
grou****************ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to
the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving
emails from it, send an email to elasticsearc...@**
googlegroups.com.
For more options, visit https://groups.google.com/

grou
******ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to
the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@**googlegroups.
**c*********om.
For more options, visit https://groups.google.com/**grou

***********ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@googlegroups.*c
******om.
For more options, visit https://groups.google.com/**grou

*******ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.c**
om.
For more options, visit https://groups.google.com/**grou******
**ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**grou

ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou
**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Each bulk operation line is split out from the stream and passed to other
nodes where the shard resides. It would be expensive to reconstruct every
time the operation the user wants to execute for the doc.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.