On Tue, Oct 29, 2013 at 4:02 PM, Mauro Farracha farracha@gmail.com wrote:
Thanks Honza.
I was importing the class. But not the right way missing the connection
part in from.
I thought it was the case, I update the code so that it should work in the
future.
I changed the serializer (used ujson as Jorg mentioned) and I got an
improvement from 2.66MB/s to 4.7MB/s.
ah, good to know, I will give it a try. I wanted to avoid additional
dependencies, but if it makes sense I will happily switch the client to
ujson. have you also tried just passing in a big string?
Then I configured ThriftConnection and the write performance increased to
6.2MB/s.
Not bad, but still far off from the 12MB/s from curl.
we still have to deserialize the response which curl doesn't need to do so
it will always have an advantage on us I am afraid, it shouldn't be this
big though.
Have two questions:
-
using elasticsearch-py the index mapping is not the same on the server
side as when using pyes. Am I missing something? With pyes all the
properties were there, but using elasticsearch-py, only type appears on the
server side and are not the ones I specified. On the server log, It shows
"update_mapping [accesslogs] (dynamic)" which doesn't happen with pyes. I'm
sure I'm missing some property/config.
how I invoke: (the mapping is on the second post)
self.client.indices.put_mapping(index=self.doc_collection,
doc_type=self.doc_type,body={'properties':self.doc_mapping})
body should also include the doc_type, so:
self.client.indices.put_mapping(index=self.doc_collection,
doc_type=self.doc_type,body={self._doc_type:
{'properties':self.doc_mapping}})
- Also, can you guys share what's your performance on a single local node?
I haven't done any tests like this, it varies so much with different
HW/configuration/environment that there is little value in absolute
numbers, only thing that matters is the relative speed of python clients,
curl etc.
As I mention on my first post, these are my non-default configurations,
maybe there is still room for improvement? Not to mention of course, that
these same settings were responsible for the 12MB/s on curl.
indices.memory.index_buffer_**size: 50%
indices.memory.min_index_**buffer_size: 300mb
index.translog.flush_**threshold: 30000
index.store.type: mmapfs
index.merge.policy.use_**compound_file: false
these look reasonable though I am no expert. Also when using SSDs you might
benefit from switching the kernel IO scheduler to noop:
On Tuesday, 29 October 2013 14:32:12 UTC, Honza Král wrote:
you need to import it before you intend to use it:
from elasticsearch.connection import ThriftConnection
On Tue, Oct 29, 2013 at 3:02 PM, Mauro Farracha farr...@gmail.comwrote:
Ahhhh you are the source!
As I mentioned on the first post, I wrote a python script using
elasticsearch-py also and the performance was equals to pyes, but I
couldn't get it working with Thrift. The documentation available for me was
not detailed enough so I could understand how to fully use all the features
and was a little bit confusing the Connection/Transport classes.
Maybe you could help me out... the error was:
self.client = Elasticsearch(hosts=self.*elasticsearch_conn,connection_
*class=ThriftConnection)
NameError: global name 'ThriftConnection' is not defined
I have ES thrift plugin installed (works on pyes), I have the thrift
python module installed and I import the class. Don't know what I'm missing.
On Tuesday, 29 October 2013 13:16:47 UTC, Honza Král wrote:
I am not familiar with pyes, I did however write elasticsearch-py and
made sure you can bypass the serialization by doing it yourself. If needed
you can even supply your own serializer - just create an instance that has
.dumps() and loads() methods and behaves the same as
elasticsearch.serializer.JSONSerializer. you can then pass it to
the Elasticsearch class as an argument (serializer=my_faster_**serializ
**er)
On Tue, Oct 29, 2013 at 2:07 PM, Mauro Farracha farr...@gmail.comwrote:
Hi Honza,
Ok, that could be a problem. I'm passing a python dictionary to pyes
driver. If I send a "string" json format I could pass the serialization?
Are you familiar with pyes driver?
I saw this method signature, but don't know what's the "header", and
the document can it be one full string with several documents?
index_raw_bulk(header, document)http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ES.index_raw_bulk
Function helper for fast inserting
Parameters:
- header – a string with the bulk header must be ended with a
newline
- header – a json document string must be ended with a newline
On Tuesday, 29 October 2013 12:55:55 UTC, Honza Král wrote:
Hi,
and what was the bottle neck? Has the pyhton process maxed out the
CPU or was it waiting for network? You can try serializing the documents
yourself and passing json strings to the client's bulk() method to make
sure that's not the bottle neck (you can pass in list of strings or just
one big string and we will just pass it along).
The python client does more than curl - it serializes data and parses
output, that's at least 2 cpu intensive operations that need to happen. One
of them you can eliminate.
On Tue, Oct 29, 2013 at 1:23 PM, joerg...@gmail.com <
joerg...@gmail.com> wrote:
Also to mention, the number of shards and replica, which affect a
lot the indexing performance.
Jörg
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**grou****
ps/opt_out https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.