ES write performance

Hi,

I wrote a python script using elasticsearch-py and another using pyes,
configured my bulk-size to be 5000 records (tested with more without
improvement), one node only, no refresh interval, no replicas, thrift
protocol and the node runs on top of SSD. The maximum insert performance
that I could achieve was 10.400 records per second. If I split my input
source by four, the write performance degrades to 6500/sec.

Interesting... if I put all the documents on a text file (332.400 total
documents ~100MB) and run the bulk insert from the curl command, I ended up
with total time of ~8 seconds which means ~41.000/sec.

Non-default ES config changes:
indices.memory.index_buffer_size: 50%
indices.memory.min_index_buffer_size: 300mb
index.translog.flush_threshold: 30000
index.store.type: mmapfs
index.merge.policy.use_compound_file: false

Is this write performance the best I could get with python? How can I
improve it?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Some info is unknown about your setup: What about refresh interval during
bulk? What about concurrency in the bulk requests? What mapping do you use?
Do you measure the time after refresh, or optimize? Do you index over
network interface, from remote or local host? Do you use compression? How
much heap and RAM is available? How many CPU cores are running, at which
speed? How fast are sustainable writes by the SSD, is it using 3Gbit/s or
6Gbit/s interface (SATA) or even PCIe?

It is not easy to compare thrift and HTTP, since the protocols are very
different.

In Python I once was aware of a slow JSON standard codec but this can be
replaced.

Also I recommend describing indexing performance by MB/sec, not doc/sec,
since doc sizes may vary. Note that 100MB per 8 sec means ~12 MB/sec which
could be a limit in your single node hardware.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Also to mention, the number of shards and replica, which affect a lot the
indexing performance.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

and what was the bottle neck? Has the pyhton process maxed out the CPU or
was it waiting for network? You can try serializing the documents yourself
and passing json strings to the client's bulk() method to make sure that's
not the bottle neck (you can pass in list of strings or just one big string
and we will just pass it along).

The python client does more than curl - it serializes data and parses
output, that's at least 2 cpu intensive operations that need to happen. One
of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Also to mention, the number of shards and replica, which affect a lot the
indexing performance.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jorg,

First of all, thanks for your attention.

Not all info is unknown :slight_smile:

  • Refresh interval was set to -1

  • Replica set to 0

  • Shards: default (5)

  • Concurrency? Was single process, multiple processes were introduced on a
    second test just to see if there was an improvement, but actually was worse

  • Mapping: {'worker':{'index': 'not_analyzed','type': 'string',
    'include_in_all' : 'true'}, 'filename':{'index': 'no','type': 'string',
    'include_in_all' : 'false'}, 'extractionts':{'index': 'no','type': 'date',
    'include_in_all' : 'false', 'format':'yyyy-MM-dd HH:mm'}, 'line': {'index':
    'no','type': 'integer', 'include_in_all' : 'false'}, 'timest':{'index':
    'not_analyzed','type': 'date', 'include_in_all' : 'false',
    'format':'dd/MMM/yyyy:HH:mm:ss'}, 'server':{'index': 'not_analyzed','type':
    'ip', 'include_in_all' : 'false'}, 'http_method':{'index':
    'not_analyzed','type': 'string', 'include_in_all' : 'true'},'url':{'index':
    'not_analyzed','type': 'string', 'include_in_all' :
    'true'},'http_status':{'index': 'not_analyzed','type': 'integer',
    'include_in_all' : 'true'}}

  • I measure the time after flushing at the end of sending all documents,
    and only then I re-configure the refresh_interval back to 1s and replica
    back to 1

  • I don't run optimize, I'm just focused on inserting data as fast as I can
    get

  • Single local node

  • No compression

  • Memory: 8Gb ram, 3gb set Xms=Xmx

  • quad-core: 8 CPU 2.2Ghz

  • SATA 3Gbit/s, performance tool gave me on average 217MB/s

  • Average performance write ratio based on MB/s is 2.66MB/s!! too far from
    the curl performance which was 12MB/sec

In the meantime, I'll look how can I change python default json encoder.
But this "enconding" happens on the python-es driver, right?!

On Tuesday, 29 October 2013 12:21:13 UTC, Jörg Prante wrote:

Some info is unknown about your setup: What about refresh interval during
bulk? What about concurrency in the bulk requests? What mapping do you use?
Do you measure the time after refresh, or optimize? Do you index over
network interface, from remote or local host? Do you use compression? How
much heap and RAM is available? How many CPU cores are running, at which
speed? How fast are sustainable writes by the SSD, is it using 3Gbit/s or
6Gbit/s interface (SATA) or even PCIe?

It is not easy to compare thrift and HTTP, since the protocols are very
different.

In Python I once was aware of a slow JSON standard codec but this can be
replaced.

Also I recommend describing indexing performance by MB/sec, not doc/sec,
since doc sizes may vary. Note that 100MB per 8 sec means ~12 MB/sec which
could be a limit in your single node hardware.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Honza,

Ok, that could be a problem. I'm passing a python dictionary to pyes
driver. If I send a "string" json format I could pass the serialization?
Are you familiar with pyes driver?

I saw this method signature, but don't know what's the "header", and the
document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended with a newline
  • header – a json document string must be ended with a newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed out the CPU or
was it waiting for network? You can try serializing the documents yourself
and passing json strings to the client's bulk() method to make sure that's
not the bottle neck (you can pass in list of strings or just one big string
and we will just pass it along).

The python client does more than curl - it serializes data and parses
output, that's at least 2 cpu intensive operations that need to happen. One
of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <javascript:> <
joerg...@gmail.com <javascript:>> wrote:

Also to mention, the number of shards and replica, which affect a lot the
indexing performance.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I am not familiar with pyes, I did however write elasticsearch-py and made
sure you can bypass the serialization by doing it yourself. If needed you
can even supply your own serializer - just create an instance that has
.dumps() and loads() methods and behaves the same as
elasticsearch.serializer.JSONSerializer. you can then pass it to the
Elasticsearch class as an argument (serializer=my_faster_serializer)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha farracha@gmail.com wrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python dictionary to pyes
driver. If I send a "string" json format I could pass the serialization?
Are you familiar with pyes driver?

I saw this method signature, but don't know what's the "header", and the
document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended with a newline
  • header – a json document string must be ended with a newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed out the CPU or
was it waiting for network? You can try serializing the documents yourself
and passing json strings to the client's bulk() method to make sure that's
not the bottle neck (you can pass in list of strings or just one big string
and we will just pass it along).

The python client does more than curl - it serializes data and parses
output, that's at least 2 cpu intensive operations that need to happen. One
of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com joerg...@gmail.comwrote:

Also to mention, the number of shards and replica, which affect a lot
the indexing performance.

Jörg

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I suggest you should use the official python client
https://github.com/elasticsearch/elasticsearch-py instead of pyes, because
it has a much cleaner bulk API.

If you are sure your CPU is burnt with JSON serialization, maybe it is
worth to experiment with a faster JSON codec like ujson
https://pypi.python.org/pypi/ujson/ and replace standard json module in
https://github.com/elasticsearch/elasticsearch-py/blob/master/elasticsearch/serializer.py

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script using
elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client =
Elasticsearch(hosts=self.elasticsearch_conn,connection_class=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have the thrift python
module installed and I import the class. Don't know what I'm missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write elasticsearch-py and made
sure you can bypass the serialization by doing it yourself. If needed you
can even supply your own serializer - just create an instance that has
.dumps() and loads() methods and behaves the same as
elasticsearch.serializer.JSONSerializer. you can then pass it to the
Elasticsearch class as an argument (serializer=my_faster_serializer)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha <farr...@gmail.com<javascript:>

wrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python dictionary to pyes
driver. If I send a "string" json format I could pass the serialization?
Are you familiar with pyes driver?

I saw this method signature, but don't know what's the "header", and the
document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended with a
    newline
  • header – a json document string must be ended with a newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed out the CPU
or was it waiting for network? You can try serializing the documents
yourself and passing json strings to the client's bulk() method to make
sure that's not the bottle neck (you can pass in list of strings or just
one big string and we will just pass it along).

The python client does more than curl - it serializes data and parses
output, that's at least 2 cpu intensive operations that need to happen. One
of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com joerg...@gmail.comwrote:

Also to mention, the number of shards and replica, which affect a lot
the indexing performance.

Jörg

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha farracha@gmail.com wrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script using
elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client =
Elasticsearch(hosts=self.elasticsearch_conn,connection_class=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have the thrift
python module installed and I import the class. Don't know what I'm missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write elasticsearch-py and
made sure you can bypass the serialization by doing it yourself. If needed
you can even supply your own serializer - just create an instance that has
.dumps() and loads() methods and behaves the same as
elasticsearch.serializer.**JSONSerializer. you can then pass it to the
Elasticsearch class as an argument (serializer=my_faster_**serializer)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha farr...@gmail.comwrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python dictionary to pyes
driver. If I send a "string" json format I could pass the serialization?
Are you familiar with pyes driver?

I saw this method signature, but don't know what's the "header", and the
document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended with a
    newline
  • header – a json document string must be ended with a newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed out the CPU
or was it waiting for network? You can try serializing the documents
yourself and passing json strings to the client's bulk() method to make
sure that's not the bottle neck (you can pass in list of strings or just
one big string and we will just pass it along).

The python client does more than curl - it serializes data and parses
output, that's at least 2 cpu intensive operations that need to happen. One
of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <joerg...@gmail.com

wrote:

Also to mention, the number of shards and replica, which affect a lot
the indexing performance.

Jörg

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.**com.

For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing the connection
part in from.

I changed the serializer (used ujson as Jorg mentioned) and I got an
improvement from 2.66MB/s to 4.7MB/s.

Then I configured ThriftConnection and the write performance increased to
6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on the server
    side as when using pyes. Am I missing something? With pyes all the
    properties were there, but using elasticsearch-py, only type appears on the
    server side and are not the ones I specified. On the server log, It shows
    "update_mapping [accesslogs] (dynamic)" which doesn't happen with pyes. I'm
    sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_mapping(index=self.doc_collection,
    doc_type=self.doc_type,body={'properties':self.doc_mapping})

    Server side result:
    "properties": {
    "extractionts": {
    "type": "string"
    },
    "filename": {
    "type": "string"
    },
    "http_method": {
    "type": "string"
    },
    "http_status": {
    "type": "string"
    },
    "line": {
    "type": "long"
    },
    "server": {
    "type": "string"
    },
    "timest": {
    "type": "string"
    },
    "url": {
    "type": "string"
    },
    "worker": {
    "type": "string"
    }
    }

  • Also, can you guys share what's your performance on a single local node?

As I mention on my first post, these are my non-default configurations,
maybe there is still room for improvement? Not to mention of course, that
these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_size: 50%
indices.memory.min_index_buffer_size: 300mb
index.translog.flush_threshold: 30000
index.store.type: mmapfs
index.merge.policy.use_compound_file: false

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha <farr...@gmail.com<javascript:>

wrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script using
elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client =
Elasticsearch(hosts=self.elasticsearch_conn,connection_class=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have the thrift
python module installed and I import the class. Don't know what I'm missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write elasticsearch-py and
made sure you can bypass the serialization by doing it yourself. If needed
you can even supply your own serializer - just create an instance that has
.dumps() and loads() methods and behaves the same as
elasticsearch.serializer.**JSONSerializer. you can then pass it to the
Elasticsearch class as an argument (serializer=my_faster_**serializer)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha farr...@gmail.comwrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python dictionary to pyes
driver. If I send a "string" json format I could pass the serialization?
Are you familiar with pyes driver?

I saw this method signature, but don't know what's the "header", and
the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended with a
    newline
  • header – a json document string must be ended with a newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed out the CPU
or was it waiting for network? You can try serializing the documents
yourself and passing json strings to the client's bulk() method to make
sure that's not the bottle neck (you can pass in list of strings or just
one big string and we will just pass it along).

The python client does more than curl - it serializes data and parses
output, that's at least 2 cpu intensive operations that need to happen. One
of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica, which affect a lot
the indexing performance.

Jörg

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**com.

For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

At least for http, if not also for thrift unless already included, I would
like to suggest gzip compression on the wire, but I'm not sure how the
python client can this enable.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha farracha@gmail.com wrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing the connection
part in from.

I thought it was the case, I update the code so that it should work in the
future.

I changed the serializer (used ujson as Jorg mentioned) and I got an
improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid additional
dependencies, but if it makes sense I will happily switch the client to
ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance increased to
6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't need to do so
it will always have an advantage on us I am afraid, it shouldn't be this
big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on the server
    side as when using pyes. Am I missing something? With pyes all the
    properties were there, but using elasticsearch-py, only type appears on the
    server side and are not the ones I specified. On the server log, It shows
    "update_mapping [accesslogs] (dynamic)" which doesn't happen with pyes. I'm
    sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_mapping(index=self.doc_collection,
    doc_type=self.doc_type,body={'properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_mapping(index=self.doc_collection,
doc_type=self.doc_type,body={self._doc_type:
{'properties':self.doc_mapping}})

  • Also, can you guys share what's your performance on a single local node?

I haven't done any tests like this, it varies so much with different
HW/configuration/environment that there is little value in absolute
numbers, only thing that matters is the relative speed of python clients,
curl etc.

As I mention on my first post, these are my non-default configurations,
maybe there is still room for improvement? Not to mention of course, that
these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_**size: 50%
indices.memory.min_index_**buffer_size: 300mb
index.translog.flush_**threshold: 30000
index.store.type: mmapfs
index.merge.policy.use_**compound_file: false

these look reasonable though I am no expert. Also when using SSDs you might
benefit from switching the kernel IO scheduler to noop:

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha farr...@gmail.comwrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script using
elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.*elasticsearch_conn,connection_
*class=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have the thrift
python module installed and I import the class. Don't know what I'm missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write elasticsearch-py and
made sure you can bypass the serialization by doing it yourself. If needed
you can even supply your own serializer - just create an instance that has
.dumps() and loads() methods and behaves the same as
elasticsearch.serializer.JSONSerializer. you can then pass it to
the Elasticsearch class as an argument (serializer=my_faster_**serializ
**er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha farr...@gmail.comwrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python dictionary to pyes
driver. If I send a "string" json format I could pass the serialization?
Are you familiar with pyes driver?

I saw this method signature, but don't know what's the "header", and
the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended with a
    newline
  • header – a json document string must be ended with a newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed out the
CPU or was it waiting for network? You can try serializing the documents
yourself and passing json strings to the client's bulk() method to make
sure that's not the bottle neck (you can pass in list of strings or just
one big string and we will just pass it along).

The python client does more than curl - it serializes data and parses
output, that's at least 2 cpu intensive operations that need to happen. One
of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica, which affect a
lot the indexing performance.

Jörg

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**grou****
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hmmm... I'll try to investigate this lead, but compression adds cpu
processing and probably ES would need to decompress?

On Tuesday, 29 October 2013 15:30:29 UTC, Jörg Prante wrote:

At least for http, if not also for thrift unless already included, I would
like to suggest gzip compression on the wire, but I'm not sure how the
python client can this enable.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It's a space/time tradeoff, compressed data needs a fraction of network
transport resources and saves memory overhead. If that is more gain than a
CPU spends in compressing/decompressing, it's a win. The larger the data,
the more compressing data wins.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yep, makes sense when we have network in the middle.

On Tuesday, 29 October 2013 16:19:47 UTC, Jörg Prante wrote:

It's a space/time tradeoff, compressed data needs a fraction of network
transport resources and saves memory overhead. If that is more gain than a
CPU spends in compressing/decompressing, it's a win. The larger the data,
the more compressing data wins.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Honza,

Yep, I understand the issue around dependencies. The minimum you can do is
probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index type solved
the problem.

Since elasticsearch-py uses connection pooling with round-robin by default,
I was wondering if I could get more improvement if I had two nodes up,
since I would distribute the load between two servers, but using
ThriftConnection it throws an error which I don't understand why It happens
since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's an issue with
    sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request, right? So I
would end up sending one bulk action to node1 and the second would go to
node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha <farr...@gmail.com<javascript:>

wrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing the
connection part in from.

I thought it was the case, I update the code so that it should work in the
future.

I changed the serializer (used ujson as Jorg mentioned) and I got an
improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid additional
dependencies, but if it makes sense I will happily switch the client to
ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance increased to
6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't need to do so
it will always have an advantage on us I am afraid, it shouldn't be this
big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on the server
    side as when using pyes. Am I missing something? With pyes all the
    properties were there, but using elasticsearch-py, only type appears on the
    server side and are not the ones I specified. On the server log, It shows
    "update_mapping [accesslogs] (dynamic)" which doesn't happen with pyes. I'm
    sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_mapping(index=self.doc_collection,
    doc_type=self.doc_type,body={'properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_mapping(index=self.doc_collection,
doc_type=self.doc_type,body={self._doc_type:
{'properties':self.doc_mapping}})

  • Also, can you guys share what's your performance on a single local node?

I haven't done any tests like this, it varies so much with different
HW/configuration/environment that there is little value in absolute
numbers, only thing that matters is the relative speed of python clients,
curl etc.

As I mention on my first post, these are my non-default configurations,
maybe there is still room for improvement? Not to mention of course, that
these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_**size: 50%
indices.memory.min_index_**buffer_size: 300mb
index.translog.flush_**threshold: 30000
index.store.type: mmapfs
index.merge.policy.use_**compound_file: false

these look reasonable though I am no expert. Also when using SSDs you
might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha farr...@gmail.comwrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script using
elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.**elasticsearch_conn,connection_
**class=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have the thrift
python module installed and I import the class. Don't know what I'm missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write elasticsearch-py and
made sure you can bypass the serialization by doing it yourself. If needed
you can even supply your own serializer - just create an instance that has
.dumps() and loads() methods and behaves the same as
elasticsearch.serializer.JSONSerializer. you can then pass it to
the Elasticsearch class as an argument (serializer=my_faster_**
serializ**er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha farr...@gmail.comwrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python dictionary to pyes
driver. If I send a "string" json format I could pass the serialization?
Are you familiar with pyes driver?

I saw this method signature, but don't know what's the "header", and
the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended with a
    newline
  • header – a json document string must be ended with a newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed out the
CPU or was it waiting for network? You can try serializing the documents
yourself and passing json strings to the client's bulk() method to make
sure that's not the bottle neck (you can pass in list of strings or just
one big string and we will just pass it along).

The python client does more than curl - it serializes data and
parses output, that's at least 2 cpu intensive operations that need to
happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica, which affect a
lot the indexing performance.

Jörg

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**grou****
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It's not related to Thrift, using http also shares this behaviour.

On Tuesday, 29 October 2013 16:27:32 UTC, Mauro Farracha wrote:

Hi Honza,

Yep, I understand the issue around dependencies. The minimum you can do is
probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index type solved
the problem.

Since elasticsearch-py uses connection pooling with round-robin by
default, I was wondering if I could get more improvement if I had two nodes
up, since I would distribute the load between two servers, but using
ThriftConnection it throws an error which I don't understand why It happens
since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's an issue with
    sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request, right? So I
would end up sending one bulk action to node1 and the second would go to
node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha farr...@gmail.comwrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing the
connection part in from.

I thought it was the case, I update the code so that it should work in
the future.

I changed the serializer (used ujson as Jorg mentioned) and I got an
improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid additional
dependencies, but if it makes sense I will happily switch the client to
ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance increased
to 6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't need to do
so it will always have an advantage on us I am afraid, it shouldn't be this
big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on the server
    side as when using pyes. Am I missing something? With pyes all the
    properties were there, but using elasticsearch-py, only type appears on the
    server side and are not the ones I specified. On the server log, It shows
    "update_mapping [accesslogs] (dynamic)" which doesn't happen with pyes. I'm
    sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_mapping(index=self.doc_collection,
    doc_type=self.doc_type,body={'properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_mapping(index=self.doc_collection,
doc_type=self.doc_type,body={self._doc_type:
{'properties':self.doc_mapping}})

  • Also, can you guys share what's your performance on a single local
    node?

I haven't done any tests like this, it varies so much with different
HW/configuration/environment that there is little value in absolute
numbers, only thing that matters is the relative speed of python clients,
curl etc.

As I mention on my first post, these are my non-default configurations,
maybe there is still room for improvement? Not to mention of course, that
these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_**size: 50%
indices.memory.min_index_**buffer_size: 300mb
index.translog.flush_**threshold: 30000
index.store.type: mmapfs
index.merge.policy.use_**compound_file: false

these look reasonable though I am no expert. Also when using SSDs you
might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha farr...@gmail.comwrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script using
elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.**
elasticsearch_conn,connection_**class=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have the thrift
python module installed and I import the class. Don't know what I'm missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write elasticsearch-py and
made sure you can bypass the serialization by doing it yourself. If needed
you can even supply your own serializer - just create an instance that has
.dumps() and loads() methods and behaves the same as
elasticsearch.serializer.JSONSerializer. you can then pass it to
the Elasticsearch class as an argument (serializer=my_faster_**
serializ**er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha farr...@gmail.comwrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python dictionary to
pyes driver. If I send a "string" json format I could pass the
serialization? Are you familiar with pyes driver?

I saw this method signature, but don't know what's the "header", and
the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended with a
    newline
  • header – a json document string must be ended with a newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed out the
CPU or was it waiting for network? You can try serializing the documents
yourself and passing json strings to the client's bulk() method to make
sure that's not the bottle neck (you can pass in list of strings or just
one big string and we will just pass it along).

The python client does more than curl - it serializes data and
parses output, that's at least 2 cpu intensive operations that need to
happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica, which affect a
lot the indexing performance.

Jörg

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**grou****
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Regarding the IO Scheduler, I'm using a macos system. I don't think it
applies here. Although, great presentation on recovery shard performance!
To be in consideration in the future.

On Tuesday, 29 October 2013 16:29:03 UTC, Mauro Farracha wrote:

It's not related to Thrift, using http also shares this behaviour.

On Tuesday, 29 October 2013 16:27:32 UTC, Mauro Farracha wrote:

Hi Honza,

Yep, I understand the issue around dependencies. The minimum you can do
is probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index type solved
the problem.

Since elasticsearch-py uses connection pooling with round-robin by
default, I was wondering if I could get more improvement if I had two nodes
up, since I would distribute the load between two servers, but using
ThriftConnection it throws an error which I don't understand why It happens
since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's an issue
    with sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request, right? So I
would end up sending one bulk action to node1 and the second would go to
node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha farr...@gmail.comwrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing the
connection part in from.

I thought it was the case, I update the code so that it should work in
the future.

I changed the serializer (used ujson as Jorg mentioned) and I got an
improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid additional
dependencies, but if it makes sense I will happily switch the client to
ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance increased
to 6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't need to do
so it will always have an advantage on us I am afraid, it shouldn't be this
big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on the
    server side as when using pyes. Am I missing something? With pyes all the
    properties were there, but using elasticsearch-py, only type appears on the
    server side and are not the ones I specified. On the server log, It shows
    "update_mapping [accesslogs] (dynamic)" which doesn't happen with pyes. I'm
    sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_mapping(index=self.doc_collection,
    doc_type=self.doc_type,body={'properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_mapping(index=self.doc_collection,
doc_type=self.doc_type,body={self._doc_type:
{'properties':self.doc_mapping}})

  • Also, can you guys share what's your performance on a single local
    node?

I haven't done any tests like this, it varies so much with different
HW/configuration/environment that there is little value in absolute
numbers, only thing that matters is the relative speed of python clients,
curl etc.

As I mention on my first post, these are my non-default configurations,
maybe there is still room for improvement? Not to mention of course, that
these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_**size: 50%
indices.memory.min_index_**buffer_size: 300mb
index.translog.flush_**threshold: 30000
index.store.type: mmapfs
index.merge.policy.use_**compound_file: false

these look reasonable though I am no expert. Also when using SSDs you
might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha farr...@gmail.comwrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script using
elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.**
elasticsearch_conn,connection_**class=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have the thrift
python module installed and I import the class. Don't know what I'm missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write elasticsearch-py
and made sure you can bypass the serialization by doing it yourself. If
needed you can even supply your own serializer - just create an instance
that has .dumps() and loads() methods and behaves the same as
elasticsearch.serializer.JSONSerializer. you can then pass it
to the Elasticsearch class as an argument (serializer=my_faster_**
serializ**er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha farr...@gmail.comwrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python dictionary to
pyes driver. If I send a "string" json format I could pass the
serialization? Are you familiar with pyes driver?

I saw this method signature, but don't know what's the "header",
and the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended with a
    newline
  • header – a json document string must be ended with a newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed out the
CPU or was it waiting for network? You can try serializing the documents
yourself and passing json strings to the client's bulk() method to make
sure that's not the bottle neck (you can pass in list of strings or just
one big string and we will just pass it along).

The python client does more than curl - it serializes data and
parses output, that's at least 2 cpu intensive operations that need to
happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica, which affect a
lot the indexing performance.

Jörg

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**grou****
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**grou

ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Oops, there was a bug in the sniffing code, now fixed in master:

Can you please try again with master?

Thanks!

On Tue, Oct 29, 2013 at 5:29 PM, Mauro Farracha farracha@gmail.com wrote:

It's not related to Thrift, using http also shares this behaviour.

On Tuesday, 29 October 2013 16:27:32 UTC, Mauro Farracha wrote:

Hi Honza,

Yep, I understand the issue around dependencies. The minimum you can do
is probably add this sort of information in documentation.

Regarding the mapping issue, you were right, adding the index type solved
the problem.

Since elasticsearch-py uses connection pooling with round-robin by
default, I was wondering if I could get more improvement if I had two nodes
up, since I would distribute the load between two servers, but using
ThriftConnection it throws an error which I don't understand why It happens
since Im pretty sure that Im passing the right configuration:

connection_pool.py", line 60, in select
self.rr %= len(connections)
ZeroDivisionError: integer division or modulo by zero

Scenarios:

  • two node, sniff_* properties => zerodivisionerror
  • one node, sniff_* properties => zerodivisionerror (so it's an issue
    with sniff properties?)
  • one node, no sniff_* properties => no problems
  • two node, no sniff_* properties => timeout connecting to ES.

I'm understanding that round-robin is used on each request, right? So I
would end up sending one bulk action to node1 and the second would go to
node2?

Thanks

On Tuesday, 29 October 2013 16:01:17 UTC, Honza Král wrote:

On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha farr...@gmail.comwrote:

Thanks Honza.

I was importing the class. But not the right way :slight_smile: missing the
connection part in from.

I thought it was the case, I update the code so that it should work in
the future.

I changed the serializer (used ujson as Jorg mentioned) and I got an
improvement from 2.66MB/s to 4.7MB/s.

ah, good to know, I will give it a try. I wanted to avoid additional
dependencies, but if it makes sense I will happily switch the client to
ujson. have you also tried just passing in a big string?

Then I configured ThriftConnection and the write performance increased
to 6.2MB/s.

Not bad, but still far off from the 12MB/s from curl.

we still have to deserialize the response which curl doesn't need to do
so it will always have an advantage on us I am afraid, it shouldn't be this
big though.

Have two questions:

  • using elasticsearch-py the index mapping is not the same on the
    server side as when using pyes. Am I missing something? With pyes all the
    properties were there, but using elasticsearch-py, only type appears on the
    server side and are not the ones I specified. On the server log, It shows
    "update_mapping [accesslogs] (dynamic)" which doesn't happen with pyes. I'm
    sure I'm missing some property/config.

    how I invoke: (the mapping is on the second post)
    self.client.indices.put_**mapping(index=self.doc_**collection,
    doc_type=self.doc_type,body={'**properties':self.doc_mapping})

body should also include the doc_type, so:
self.client.indices.put_**mapping(index=self.doc_**collection,
doc_type=self.doc_type,body={self.doc_type: {'properties':self.doc
mapping}})

  • Also, can you guys share what's your performance on a single local
    node?

I haven't done any tests like this, it varies so much with different
HW/configuration/environment that there is little value in absolute
numbers, only thing that matters is the relative speed of python clients,
curl etc.

As I mention on my first post, these are my non-default configurations,
maybe there is still room for improvement? Not to mention of course, that
these same settings were responsible for the 12MB/s on curl.

indices.memory.index_buffer_size: 50%
indices.memory.min_index_buffer_size: 300mb
index.translog.flush_threshold: 30000
index.store.type: mmapfs
index.merge.policy.use_compound_file: false

these look reasonable though I am no expert. Also when using SSDs you
might benefit from switching the kernel IO scheduler to noop:
https://speakerdeck.com/**elasticsearch/life-after-ec2https://speakerdeck.com/elasticsearch/life-after-ec2

On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:

you need to import it before you intend to use it:

from elasticsearch.connection import ThriftConnection

On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha farr...@gmail.comwrote:

Ahhhh you are the source! :slight_smile:

As I mentioned on the first post, I wrote a python script using
elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.

Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.elast
icsearch_conn,connection_class=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined

I have ES thrift plugin installed (works on pyes), I have the thrift
python module installed and I import the class. Don't know what I'm missing.

On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:

I am not familiar with pyes, I did however write elasticsearch-py
and made sure you can bypass the serialization by doing it yourself. If
needed you can even supply your own serializer - just create an instance
that has .dumps() and loads() methods and behaves the same as
elasticsearch.serializer.JSONS****erializer. you can then pass it
to the Elasticsearch class as an argument (serializer=my_faster_

serializ****er)

On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha farr...@gmail.comwrote:

Hi Honza,

Ok, that could be a problem. I'm passing a python dictionary to
pyes driver. If I send a "string" json format I could pass the
serialization? Are you familiar with pyes driver?

I saw this method signature, but don't know what's the "header",
and the document can it be one full string with several documents?

index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk

Function helper for fast inserting
Parameters:

  • header – a string with the bulk header must be ended with a
    newline
  • header – a json document string must be ended with a newline

On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:

Hi,

and what was the bottle neck? Has the pyhton process maxed out the
CPU or was it waiting for network? You can try serializing the documents
yourself and passing json strings to the client's bulk() method to make
sure that's not the bottle neck (you can pass in list of strings or just
one big string and we will just pass it along).

The python client does more than curl - it serializes data and
parses output, that's at least 2 cpu intensive operations that need to
happen. One of them you can eliminate.

On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:

Also to mention, the number of shards and replica, which affect a
lot the indexing performance.

Jörg

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**c****om.

For more options, visit https://groups.google.com/**grou******
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/**grou
**
ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.**com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.