Slow inserts on bigger documents


(Eugene-2) #1

Hey guys,

I'm struggling with a strange ElasticSearch insert problem (tried both
0.10 and git master versions). On small ~800 chars documents the
insert takes 0.015s (http://tinypaste.com/6f186) but on bigger
~4000chars documents the insert time jumps to over 2 seconds:
http://tinypaste.com/7422fb !

I have one local es instance and i've tried with different number of
shards and store options. I've also tried to tweak 'direct',
'buffer_size', 'warm_cache' etc. Plus i've tried starting
elasticsearch with different -Des.index.gateway.snapshot_interval and -
Des.index.engine.robin.refresh_interval arguments. Nothing seems to
help :frowning:

I hope you can point me in the right direction so i can index all of
my 200000 documents (both small and big).


(Shay Banon) #2

Hey,

Its curl acting up. It takes about 2-8 milliseconds on my laptop to index
it using a different http client (I used the mac one called HTTPClient). How
do you plan to load the docs?

-shay.banon

On Tue, Sep 28, 2010 at 1:20 PM, Eugene glum.ua@gmail.com wrote:

Hey guys,

I'm struggling with a strange ElasticSearch insert problem (tried both
0.10 and git master versions). On small ~800 chars documents the
insert takes 0.015s (http://tinypaste.com/6f186) but on bigger
~4000chars documents the insert time jumps to over 2 seconds:
http://tinypaste.com/7422fb !

I have one local es instance and i've tried with different number of
shards and store options. I've also tried to tweak 'direct',
'buffer_size', 'warm_cache' etc. Plus i've tried starting
elasticsearch with different -Des.index.gateway.snapshot_interval and -
Des.index.engine.robin.refresh_interval arguments. Nothing seems to
help :frowning:

I hope you can point me in the right direction so i can index all of
my 200000 documents (both small and big).


(Eugene-2) #3

Whoa,

I just tried with HTTPClient and it does indeed seem to get my big
document through to ES fast! Thanks a million mate, i've been banging
my head against the wall all day today.

My original plan was to use curl php extension in a php script (turns
out it is as slow as the command line curl) and iterate through my
mongodb documents. I will now look into a different php client...

On Sep 28, 2:11 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hey,

Its curl acting up. It takes about 2-8 milliseconds on my laptop to index
it using a different http client (I used the mac one called HTTPClient). How
do you plan to load the docs?

-shay.banon

On Tue, Sep 28, 2010 at 1:20 PM, Eugene glum...@gmail.com wrote:

Hey guys,

I'm struggling with a strange ElasticSearch insert problem (tried both
0.10 and git master versions). On small ~800 chars documents the
insert takes 0.015s (http://tinypaste.com/6f186) but on bigger
~4000chars documents the insert time jumps to over 2 seconds:
http://tinypaste.com/7422fb!

I have one local es instance and i've tried with different number of
shards and store options. I've also tried to tweak 'direct',
'buffer_size', 'warm_cache' etc. Plus i've tried starting
elasticsearch with different -Des.index.gateway.snapshot_interval and -
Des.index.engine.robin.refresh_interval arguments. Nothing seems to
help :frowning:

I hope you can point me in the right direction so i can index all of
my 200000 documents (both small and big).


(Shay Banon) #4

Another option is the new REST Thrift client at master:
http://github.com/elasticsearch/elasticsearch/issues/closed#issue/354. There
is a PHP client at works, maybe add it there?
http://github.com/nervetattoo/elasticsearch (I am not sure if it uses curl
PHP extension)...

-shay.banon

On Tue, Sep 28, 2010 at 2:26 PM, Eugene glum.ua@gmail.com wrote:

Whoa,

I just tried with HTTPClient and it does indeed seem to get my big
document through to ES fast! Thanks a million mate, i've been banging
my head against the wall all day today.

My original plan was to use curl php extension in a php script (turns
out it is as slow as the command line curl) and iterate through my
mongodb documents. I will now look into a different php client...

On Sep 28, 2:11 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hey,

Its curl acting up. It takes about 2-8 milliseconds on my laptop to
index
it using a different http client (I used the mac one called HTTPClient).
How
do you plan to load the docs?

-shay.banon

On Tue, Sep 28, 2010 at 1:20 PM, Eugene glum...@gmail.com wrote:

Hey guys,

I'm struggling with a strange ElasticSearch insert problem (tried both
0.10 and git master versions). On small ~800 chars documents the
insert takes 0.015s (http://tinypaste.com/6f186) but on bigger
~4000chars documents the insert time jumps to over 2 seconds:
http://tinypaste.com/7422fb!

I have one local es instance and i've tried with different number of
shards and store options. I've also tried to tweak 'direct',
'buffer_size', 'warm_cache' etc. Plus i've tried starting
elasticsearch with different -Des.index.gateway.snapshot_interval and -
Des.index.engine.robin.refresh_interval arguments. Nothing seems to
help :frowning:

I hope you can point me in the right direction so i can index all of
my 200000 documents (both small and big).


(Eugene-2) #5

PHP client does indeed use curl, so i think i'll just rewrite the
import script to use raw socket (fsockopen()). This will cover my
needs so far.
Thrift on the other hand looks really promising, i'll look in to when
the system i am prototyping goes into production.

-- Eugene

On Sep 28, 2:38 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Another option is the new REST Thrift client at master:http://github.com/elasticsearch/elasticsearch/issues/closed#issue/354. There
is a PHP client at works, maybe add it there?http://github.com/nervetattoo/elasticsearch(I am not sure if it uses curl
PHP extension)...

-shay.banon

On Tue, Sep 28, 2010 at 2:26 PM, Eugene glum...@gmail.com wrote:

Whoa,

I just tried with HTTPClient and it does indeed seem to get my big
document through to ES fast! Thanks a million mate, i've been banging
my head against the wall all day today.

My original plan was to use curl php extension in a php script (turns
out it is as slow as the command line curl) and iterate through my
mongodb documents. I will now look into a different php client...

On Sep 28, 2:11 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hey,

Its curl acting up. It takes about 2-8 milliseconds on my laptop to
index
it using a different http client (I used the mac one called HTTPClient).
How
do you plan to load the docs?

-shay.banon

On Tue, Sep 28, 2010 at 1:20 PM, Eugene glum...@gmail.com wrote:

Hey guys,

I'm struggling with a strange ElasticSearch insert problem (tried both
0.10 and git master versions). On small ~800 chars documents the
insert takes 0.015s (http://tinypaste.com/6f186) but on bigger
~4000chars documents the insert time jumps to over 2 seconds:
http://tinypaste.com/7422fb!

I have one local es instance and i've tried with different number of
shards and store options. I've also tried to tweak 'direct',
'buffer_size', 'warm_cache' etc. Plus i've tried starting
elasticsearch with different -Des.index.gateway.snapshot_interval and -
Des.index.engine.robin.refresh_interval arguments. Nothing seems to
help :frowning:

I hope you can point me in the right direction so i can index all of
my 200000 documents (both small and big).


(Ludovic Levesque) #6

Hi all,

we faced the same problem recently (indexing slow via curl, fast via
other library), and see this option:
http://curl.haxx.se/docs/faq.html#My_HTTP_POST_or_PUT_requests_are


libcurl makes all POST and PUT requests (except for POST requests with a
very tiny request body) use the "Expect: 100-continue" header. This header
allows the server to deny the operation early so that libcurl can bail out
already before having to send any data. This is useful in authentication
cases and others.

However, many servers don't implement the Expect: stuff properly and if the
server doesn't respond (positively) within 1 second libcurl will continue
and send off the data anyway.

You can disable libcurl's use of the Expect: header the same way you disable
any header, using -H / CURLOPT_HTTPHEADER, or by forcing it to use HTTP 1.0.

Faster now with curl

Ludo

On Tue, Sep 28, 2010 at 3:20 PM, Eugene glum.ua@gmail.com wrote:

PHP client does indeed use curl, so i think i'll just rewrite the
import script to use raw socket (fsockopen()). This will cover my
needs so far.
Thrift on the other hand looks really promising, i'll look in to when
the system i am prototyping goes into production.

-- Eugene

On Sep 28, 2:38 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Another option is the new REST Thrift client at master:http://github.com/elasticsearch/elasticsearch/issues/closed#issue/354. There
is a PHP client at works, maybe add it there?http://github.com/nervetattoo/elasticsearch(I am not sure if it uses curl
PHP extension)...

-shay.banon

On Tue, Sep 28, 2010 at 2:26 PM, Eugene glum...@gmail.com wrote:

Whoa,

I just tried with HTTPClient and it does indeed seem to get my big
document through to ES fast! Thanks a million mate, i've been banging
my head against the wall all day today.

My original plan was to use curl php extension in a php script (turns
out it is as slow as the command line curl) and iterate through my
mongodb documents. I will now look into a different php client...

On Sep 28, 2:11 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hey,

Its curl acting up. It takes about 2-8 milliseconds on my laptop to
index
it using a different http client (I used the mac one called HTTPClient).
How
do you plan to load the docs?

-shay.banon

On Tue, Sep 28, 2010 at 1:20 PM, Eugene glum...@gmail.com wrote:

Hey guys,

I'm struggling with a strange ElasticSearch insert problem (tried both
0.10 and git master versions). On small ~800 chars documents the
insert takes 0.015s (http://tinypaste.com/6f186) but on bigger
~4000chars documents the insert time jumps to over 2 seconds:
http://tinypaste.com/7422fb!

I have one local es instance and i've tried with different number of
shards and store options. I've also tried to tweak 'direct',
'buffer_size', 'warm_cache' etc. Plus i've tried starting
elasticsearch with different -Des.index.gateway.snapshot_interval and -
Des.index.engine.robin.refresh_interval arguments. Nothing seems to
help :frowning:

I hope you can point me in the right direction so i can index all of
my 200000 documents (both small and big).


(Shay Banon) #7

Hey,

Interesting, I will see if I can add support for that. Not using 1.0 means
that keep alive is problematic, and you really get that boost when the
client keeps reusing the same socket, though I am not sure its applicable in
PHP if it execs the curl...

-shay.banon

On Tue, Sep 28, 2010 at 5:20 PM, Ludovic Levesque luddic@gmail.com wrote:

Hi all,

we faced the same problem recently (indexing slow via curl, fast via
other library), and see this option:
http://curl.haxx.se/docs/faq.html#My_HTTP_POST_or_PUT_requests_are


libcurl makes all POST and PUT requests (except for POST requests with a
very tiny request body) use the "Expect: 100-continue" header. This header
allows the server to deny the operation early so that libcurl can bail out
already before having to send any data. This is useful in authentication
cases and others.

However, many servers don't implement the Expect: stuff properly and if the
server doesn't respond (positively) within 1 second libcurl will continue
and send off the data anyway.

You can disable libcurl's use of the Expect: header the same way you
disable
any header, using -H / CURLOPT_HTTPHEADER, or by forcing it to use HTTP
1.0.

Faster now with curl

Ludo

On Tue, Sep 28, 2010 at 3:20 PM, Eugene glum.ua@gmail.com wrote:

PHP client does indeed use curl, so i think i'll just rewrite the
import script to use raw socket (fsockopen()). This will cover my
needs so far.
Thrift on the other hand looks really promising, i'll look in to when
the system i am prototyping goes into production.

-- Eugene

On Sep 28, 2:38 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Another option is the new REST Thrift client at master:
http://github.com/elasticsearch/elasticsearch/issues/closed#issue/354.
There

is a PHP client at works, maybe add it there?
http://github.com/nervetattoo/elasticsearch(I am not sure if it uses curl

PHP extension)...

-shay.banon

On Tue, Sep 28, 2010 at 2:26 PM, Eugene glum...@gmail.com wrote:

Whoa,

I just tried with HTTPClient and it does indeed seem to get my big
document through to ES fast! Thanks a million mate, i've been banging
my head against the wall all day today.

My original plan was to use curl php extension in a php script (turns
out it is as slow as the command line curl) and iterate through my
mongodb documents. I will now look into a different php client...

On Sep 28, 2:11 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hey,

Its curl acting up. It takes about 2-8 milliseconds on my laptop
to

index

it using a different http client (I used the mac one called
HTTPClient).

How

do you plan to load the docs?

-shay.banon

On Tue, Sep 28, 2010 at 1:20 PM, Eugene glum...@gmail.com wrote:

Hey guys,

I'm struggling with a strange ElasticSearch insert problem (tried
both

0.10 and git master versions). On small ~800 chars documents the
insert takes 0.015s (http://tinypaste.com/6f186) but on bigger
~4000chars documents the insert time jumps to over 2 seconds:
http://tinypaste.com/7422fb!

I have one local es instance and i've tried with different number
of

shards and store options. I've also tried to tweak 'direct',
'buffer_size', 'warm_cache' etc. Plus i've tried starting
elasticsearch with different -Des.index.gateway.snapshot_interval
and -

Des.index.engine.robin.refresh_interval arguments. Nothing seems
to

help :frowning:

I hope you can point me in the right direction so i can index all
of

my 200000 documents (both small and big).


(system) #8